Simulate the sampling distribution of estimates from a population
Source:R/diagnose.R
      sampling_distribution.RdRepeatedly refits the model to new samples from the population, calculates estimates for each fit, and compiles a data frame of the results.
Arguments
- fit
 A model fit to data, such as by
lm()orglm(), to refit to each sample from the population.- data
 Data drawn from a
population(), usingsample_x()and possiblysample_y(). Thepopulation()specification is used to draw the samples.- fn
 Function to call on each new model fit to produce a data frame of estimates. Defaults to
broom::tidy(), which produces a tidy data frame of coefficients, estimates, standard errors, and hypothesis tests.- nsim
 Number of simulations to run.
- fixed_x
 If
TRUE, the default, the predictor variables are held fixed and only the response variables are redrawn from the population. IfFALSE, the predictor and response variables are drawn jointly.- ...
 Additional arguments passed to
fneach time it is called.
Value
Data frame (tibble) of nsim + 1 simulation results, formed by
concatenating together the data frames returned by fn. The .sample
column identifies which simulated sample each row came from. Rows with
.sample == 0 come from the original fit.
Details
To generate sampling distributions of different quantities, the user can
provide a custom fn. The fn should take a model fit as its argument and
return a data frame. For instance, the data frame might contain one row per
estimated coefficient and include the coefficient and its standard error; or
it might contain only one row of model summary statistics.
Model limitations
Because this function uses S3 generic methods such as simulate() and
update(), it can be used with any model fit for which methods are provided.
In base R, this includes lm() and glm().
The model provided as fit must be fit using the data argument to provide
a data frame. For example:
When simulating new data, this function provides the simulated data as the
data argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars.
See also
parametric_boot_distribution() to simulate draws from a fitted
model, rather than from the population
Examples
pop <- population(
  x1 = predictor(rnorm, mean = 4, sd = 10),
  x2 = predictor(runif, min = 0, max = 10),
  y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0)
)
d <- sample_x(pop, n = 20) |>
  sample_y()
fit <- lm(y ~ x1 + x2, data = d)
# using the default fn = broom::tidy(). conf.int argument is passed to
# broom::tidy()
samples <- sampling_distribution(fit, d, conf.int = TRUE)
samples
#> # A tibble: 303 × 8
#>    term        estimate std.error statistic  p.value conf.low conf.high .sample
#>    <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>   <dbl>
#>  1 (Intercept)    1.83     2.01       0.907 3.77e- 1   -2.42      6.07        0
#>  2 x1             2.24     0.100     22.3   4.99e-14    2.03      2.45        0
#>  3 x2            -0.479    0.350     -1.37  1.88e- 1   -1.22      0.258       0
#>  4 (Intercept)   -0.333    1.58      -0.210 8.36e- 1   -3.67      3.01        1
#>  5 x1             2.23     0.0790    28.3   9.95e-16    2.06      2.40        1
#>  6 x2            -0.191    0.275     -0.695 4.97e- 1   -0.771     0.389       1
#>  7 (Intercept)    1.67     2.24       0.743 4.68e- 1   -3.07      6.40        2
#>  8 x1             2.17     0.112     19.4   5.01e-13    1.93      2.40        2
#>  9 x2            -0.493    0.389     -1.27  2.23e- 1   -1.31      0.328       2
#> 10 (Intercept)    1.41     1.50       0.944 3.59e- 1   -1.75      4.57        3
#> # ℹ 293 more rows
suppressMessages(library(dplyr))
# the model is correctly specified, so the estimates are unbiased:
samples |>
  group_by(term) |>
  summarize(mean = mean(estimate),
            sd = sd(estimate))
#> # A tibble: 3 × 3
#>   term          mean     sd
#>   <chr>        <dbl>  <dbl>
#> 1 (Intercept)  0.658 1.73  
#> 2 x1           2.19  0.0790
#> 3 x2          -0.172 0.279 
# instead of coefficients, get the sampling distribution of R^2
rsquared <- function(fit) {
  data.frame(r2 = summary(fit)$r.squared)
}
sampling_distribution(fit, d, rsquared, nsim = 10)
#> # A tibble: 11 × 2
#>       r2 .sample
#>    <dbl>   <dbl>
#>  1 0.967       0
#>  2 0.980       1
#>  3 0.971       2
#>  4 0.965       3
#>  5 0.980       4
#>  6 0.977       5
#>  7 0.986       6
#>  8 0.990       7
#>  9 0.960       8
#> 10 0.968       9
#> 11 0.983      10