mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
6 Linear Models in R
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\var}{Var} \newcommand{\X}{\mathbf{X}} \]
For these examples, we’ll use the mtcars
data built into R. According to the documentation, the data frame comes from Motor Trend magazine in 1974, and contains features of performance and design for 32 cars. Here are the first few rows:
Type ?mtcars
in the R console to see documentation on the meaning of each variable.
We will also be using the broom package to convert model information to tidy (data frame) format.
library(broom)
6.1 Generic functions in R
This book does not contain an introduction to R, and I won’t describe R syntax here. (For that, see Wickham, Çetinkaya-Rundel, and Grolemund (2023).) But there is one lesser-known R feature that most R users are only vaguely aware of, but that is essential to understanding how modeling actually works: S3 generics.
The S3 system tries to solve a common problem: it is helpful to have functions that can work on many different types of arguments in consistent ways. It is also useful to be able to extend a function to handle a new argument type without having to modify its source code directly.
Consider a simple example: the print()
function. You’ve probably used it to print various types of objects in R:
<- "ducks"
foo print(foo)
[1] "ducks"
<- data.frame(a = 1:2, b = 3:4)
bar print(bar)
a b
1 1 3
2 2 4
Suppose I write a function that creates a new type of data: perhaps a new kind of model fit, or maybe it allows me to load social network data, or maybe it generates graphics. How can I make print()
work on this without having to find the R source code and edit print()
?
Fortunately, if we examine the source code of print()
, we see part of the answer:
print
function (x, ...)
UseMethod("print")
<bytecode: 0x7fc03604bab8>
<environment: namespace:base>
print()
is a generic function. It calls UseMethod("print")
, and UseMethod()
is part of the S3 generic system. It looks at the class of the first argument to the function, “class” meaning a specific attribute of the variable:
class(foo)
[1] "character"
class(bar)
[1] "data.frame"
It then calls a function print.[class]()
according to the class. These class-specific print
functions are called methods. For instance, R has a print.data.frame()
method, and in fact nearly 200 other print()
methods:
head(methods(print), n = 10)
[1] "print.acf" "print.activeConcordance"
[3] "print.AES" "print.all_vars"
[5] "print.anova" "print.any_vars"
[7] "print.aov" "print.aovlist"
[9] "print.ar" "print.Arima"
Many functions in R are S3 generic functions, and R provides methods for many of its built-in classes. For regression modeling, this is useful because there are generic methods for getting model coefficients, making confidence intervals, making predictions, and so on, and there are methods for many kinds of regression models. You can easily write code that works with any kind of model fit just by calling generic methods, instead of having one function that works on linear models and a different function that works on logistic regression fits.
There is often specific documentation for specific generic methods. For instance, ?residuals
gives the generic documentation for residuals()
, but ?residuals.lm
tells you specifically what it does for linear model fits.
For more details on the S3 system and how to create your own generic functions, see Wickham (2019) chapter 13.
6.2 Fitting
The lm()
function fits linear models using least squares. It takes two key arguments:
- A model formula, which is special syntax for specifying the names of the response and predictor variables, and any transformations, interactions, or basis expansions that might be used.
- A data frame providing the observed data, which must contain columns whose names match the terms in the model formula.
R uses the model formula and data frame to construct the design matrix; you can also use the model.matrix()
function to get the design matrix R would use for a given model formula and data frame.
The lm()
function returns a fit object containing the data, fit parameters, estimates, and various other useful pieces. For example,
<- lm(mpg ~ disp + hp, data = mtcars)
fit fit
Call:
lm(formula = mpg ~ disp + hp, data = mtcars)
Coefficients:
(Intercept) disp hp
30.73590 -0.03035 -0.02484
The variable fit
is now an object with class lm.
Model formulas place the response variable to the left of the ~
and predictors to the right, separated by +
signs. Formulas can contain transformations and some useful functions:
mpg ~ log(disp) + hp
log-transformsdisp
.mpg ~ I(disp^2) + hp
takes the square ofdisp
.I()
is necessary because the^
operator has a specific meaning in formulas (below), soI()
tells R to ignore this and evaluate it as-is.mpg ~ disp + hp - 1
instructslm()
not to insert an intercept automatically.- The
:
operator represents interaction terms, sompg ~ disp + hp + disp:hp
represents a model with linear main effects for each predictor and the interaction between them. - The
*
shorthand includes both main effects and the interaction, sompg ~ disp * hp
is equivalent to the previous formula. - The
^
calculates interactions up to a specific degree. For instance,mpg ~ (disp + hp + wt)^3
includes the three main effects, all two-way interactions, and the three-way interaction.
See the help for ?formula
for more details. We’ll discuss interactions in more detail in Chapter 7.
Exercise 6.1 (Model matrices) The model.matrix()
function takes two arguments, a formula and a data frame, just like lm()
, and produces the design matrix \(\X\) corresponding to that model formula. Use this function and the mtcars
data to examine the model matrices produced by the formulas listed above.
6.3 Coefficients and fitted values
The coef()
generic function fetches model coefficients as a named vector, so we can index it by name:
coef(fit)
(Intercept) disp hp
30.73590425 -0.03034628 -0.02484008
coef(fit)["disp"]
disp
-0.03034628
For a tidier version as a data frame, tidy()
from the broom package is very useful:
tidy(fit)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 30.7 1.33 23.1 3.26e-20
2 disp -0.0303 0.00740 -4.10 3.06e- 4
3 hp -0.0248 0.0134 -1.86 7.37e- 2
To display model fits in reports (in Quarto or R Markdown), don’t print the output of summary()
or coef()
into the report. Make a formatted table. A simple option is to pass the output of tidy()
to knitr::kable()
, which turns data frames into simple tables:
tidy(fit) |>
::kable(col.names = c("Term", "Estimate", "SE", "t", "p"),
knitrdigits = 3)
Term | Estimate | SE | t | p |
---|---|---|---|---|
(Intercept) | 30.736 | 1.332 | 23.083 | 0.000 |
disp | -0.030 | 0.007 | -4.098 | 0.000 |
hp | -0.025 | 0.013 | -1.856 | 0.074 |
For a nicer table, use the gtsummary package’s tbl_regression()
, which provides confidence intervals and has better formatting for factors and interactions:
library(gtsummary)
tbl_regression(fit, intercept = TRUE)
Characteristic | Beta | 95% CI1 | p-value |
---|---|---|---|
(Intercept) | 31 | 28, 33 | <0.001 |
disp | -0.03 | -0.05, -0.02 | <0.001 |
hp | -0.02 | -0.05, 0.00 | 0.074 |
1 CI = Confidence Interval |
6.4 Residuals and diagnostics
The vcov()
function extracts estimates of \(\var(\hat \beta)\) as a matrix, assuming the errors are independent and have equal variance (Theorem 5.3):
vcov(fit)
(Intercept) disp hp
(Intercept) 1.773068357 -0.0011510577 -0.0081943285
disp -0.001151058 0.0000548319 -0.0000783970
hp -0.008194329 -0.0000783970 0.0001791716
The fitted()
function gives the fitted values \(\hat Y\) as a vector. If the data had row names, this will be a named vector.
head(fitted(fit))
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
23.14809 23.14809 25.14838 20.17416
Hornet Sportabout Valiant
15.46423 21.29978
The residuals()
generic function fetches the residuals from a model fit. It can return different types of residuals, so see ?residuals.lm
for details; the default residuals are simply \(Y - \hat Y\).
head(residuals(fit))
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
-2.148091 -2.148091 -2.348379 1.225844
Hornet Sportabout Valiant
3.235770 -3.199783
There are a variety of similar functions for different diagnostics:
rstandard()
for standardized residualsrstudent()
for Studentized residualscooks.distance()
for Cook’s distances
See Section 5.5 for details on each kind of residual. The augment()
function from broom can put many of these diagnostics in a single data frame along with the original data. For example, Table 6.1 shows the augmented version of fit
. Notice the new columns have names beginning with .
so they cannot accidentally have the same name as a variable in your model. The data frame contains the fitted values, residuals, Cook’s distances, standardized residuals, and several other values; see ?augment.lm
for details.
augment(fit)
, showing the extra columns added by broom::augment()
.
.rownames | mpg | disp | hp | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 160 | 110 | 23.1 | -2.1 | 0.0 | 3.2 | 0 | -0.7 |
Mazda RX4 Wag | 21.0 | 160 | 110 | 23.1 | -2.1 | 0.0 | 3.2 | 0 | -0.7 |
Datsun 710 | 22.8 | 108 | 93 | 25.1 | -2.3 | 0.1 | 3.1 | 0 | -0.8 |
Hornet 4 Drive | 21.4 | 258 | 110 | 20.2 | 1.2 | 0.1 | 3.2 | 0 | 0.4 |
Hornet Sportabout | 18.7 | 360 | 175 | 15.5 | 3.2 | 0.1 | 3.1 | 0 | 1.1 |
Valiant | 18.1 | 225 | 105 | 21.3 | -3.2 | 0.1 | 3.1 | 0 | -1.1 |
6.5 Confidence intervals and model inference
To get 95% confidence intervals for model coefficients, we can use confint()
, which uses the normality assumption and \(t\) distribution of \(\hat \beta\). This returns a matrix, not a data frame, though it can be indexed by row and column name.
confint(fit)
2.5 % 97.5 %
(Intercept) 28.01254573 33.459262767
disp -0.04549091 -0.015201645
hp -0.05221650 0.002536338
confint(fit)["disp", "2.5 %"]
[1] -0.04549091
Optional arguments allow you to specify individual parameters you want intervals for, or to set the confidence level to something other than 95%. Alternately, broom’s tidy()
can be asked to include confidence intervals by passing the conf.int = TRUE
argument, as shown in Table 6.2.
tidy(fit, conf.int = TRUE)
, showing the extra conf.low
and conf.high
columns.
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 30.736 | 1.332 | 23.083 | 0.000 | 28.013 | 33.459 |
disp | -0.030 | 0.007 | -4.098 | 0.000 | -0.045 | -0.015 |
hp | -0.025 | 0.013 | -1.856 | 0.074 | -0.052 | 0.003 |
6.6 Predictions
To predict \(\hat Y\) for a new value of \(X\), we use the predict()
method. This method takes a data frame of one or more rows of new observations. This data frame must have column names that match the predictors in the original model formula.
<- data.frame(disp = c(2, 400), hp = c(400, 2))
new_cars predict(fit, newdata = new_cars)
1 2
20.73918 18.54771
A common mistake is to provide a newdata
whose columns do not exactly match the original fit. This produces an error. For example, here we’ve misspelled the disp
column:
predict(fit, newdata = data.frame(dips = c(2, 400),
hp = c(400, 2)))
Error in eval(predvars, data, env): object 'disp' not found
Another common problem is when the original formula given to lm()
refers to a specific data frame by name. For example, if we fit the model using lm(mtcars$mpg ~ mtcars$disp + mtcars$hp)
, predict()
will look for columns named mtcars$mpg
, mtcars$disp
, and mtcars$hp
inside the newdata
data frame. Using the data =
argument is always preferable, as it avoids problems like this.
Optionally, we can ask predict()
for confidence intervals or prediction intervals, using the pointwise estimators described in Section 5.4.2 and Section 5.4.3:
predict(fit, newdata = new_cars, interval = "confidence")
fit lwr upr
1 20.73918 10.77086 30.70749
2 18.54771 12.25457 24.84085
predict(fit, newdata = new_cars, interval = "prediction")
fit lwr upr
1 20.73918 8.896103 32.58226
2 18.54771 9.575828 27.51960