6 Linear Models for Experiments
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\RSS}{RSS} \DeclareMathOperator{\var}{Var} \DeclareMathOperator{\cov}{Cov} \newcommand{\ind}{\mathbb{I}} \newcommand{\T}{^\mathsf{T}} \newcommand{\tsp}{\bar \tau_\text{sp}} \newcommand{\tsphat}{\hat \tau_\text{sp}} \DeclareMathOperator*{\argmin}{arg\,min} \]
In Chapter 4, we began with a conceptually very simple model: We have \(n\) fixed experimental units, their potential outcomes are well-defined, and their responses are random only because we performed random assignment. This gives us complete control over the randomness and permits exact randomization tests.
However, this approach will prove limiting, particularly when we want to experiment on multiple treatment factors simultaneously, while blocking and controlling for covariates. The estimators become harder to define and work with. Can we use an approach that more easily handles multiple factors and covariates? Sure, if we use a model. Let’s consider our favorite models: linear models.
6.1 Super-population design
In the finite-population design (Section 4.1), our \(n\) experimental units formed the entire population. This allowed us to define the population average treatment effect as a sum over all \(n\), and ensured the counterfactuals were fixed, not random.
But we could also say that there is some hypothetical population distribution \(P\), from which our \(n\) experimental units are randomly sampled. Now the counterfactuals \(C_i(0)\) and \(C_i(1)\) are random, because unit \(i\) was randomly sampled from the population. We call this the “super-population” from which our population of \(n\) units comes. (Don’t ask why we don’t just call it a “sample.”)
The population average treatment effect is now an expectation, not a sum.
Definition 6.1 (Average treatment effect) In the super-population model, the average treatment effect \(\tsp\) is \[\begin{align*} \tsp &= \E[C_i(1) - C_i(0)] \\ &= \E[C_i(1)] - \E[C_i(0)], \end{align*}\] where the expectations average over the superpopulation.
To estimate the ATE, we will evidently have to estimate \(\E[C_i(1)]\) and \(\E[C_i(0)]\). Let’s expand those expectations: \[\begin{align*} \E[C_i(1)] &= \E[ \E[C_i(1) \mid X_i] ] & \text{law of iterated expectation}\\ &= \E[ \E[C_i(1) \mid X_i, Z_i = 1] ] & \text{when $Z_i$ is independent of $C_i$}\\ &= \E[ \E[Y_i \mid X_i, Z_i = 1] ] & \text{because $Y_i = C_i(Z_i)$}. \end{align*}\] So if we can model the conditional mean of \(Y\) given \(X\) and the treatment assignment, then average over all \(X\), we can estimate the ATE. It’s just regression! We fit models for \(\E[Y_i \mid X_i, Z_i]\) and then use them to get the ATE.
Let’s apply this to a simple randomized experiment.
Example 6.1 (Completely randomized experiment, superpopulation-style) Consider a completely randomized experiment with two treatment levels, \(Z_i \in \{0, 1\}\), and no covariates. We choose to model \[\begin{align*} \E[Y_i \mid Z_i] &= \begin{cases} \beta_0 & \text{if $Z_i = 0$} \\ \beta_0 + \beta_1 & \text{if $Z_i = 1$}. \end{cases}\\ &= \beta_0 + \beta_1 Z_i \end{align*}\] Consequently, the average treatment effect \(\tsp\) is \[ \tsp = \E[\E[Y_i \mid Z_i = 1]] - \E[\E[Y_i \mid Z_i = 0]] = (\beta_0 + \beta_1) - \beta_0 = \beta_1, \] where the outer expectations average over the covariates \(X\), but here do nothing because there are no covariates.
Traditionally, simple linear models like these are fit using ordinary least squares (OLS). The parameters \(\beta_0\) and \(\beta_1\) are obtained by minimization: \[ \hat \beta_0, \hat \beta_1 = \argmin_{\beta_0, \beta_1} \sum_{i=1}^n (Y_i - (\beta_0 + \beta_1 Z_i))^2. \] See any regression textbook for details on least squares and the derivation of analytical forms for the estimates; see my 707 notes for examples of running regression in R. We then plug in the estimates \(\hat \beta_0\) and \(\hat \beta_1\) to get our treatment effect estimates.
Exercise 6.1 (The OLS estimator) The OLS estimators are so widely used that I picked a textbook off my shelf, opened it to a random page, and found the analytical solutions: \[\begin{align*} \hat \beta_1 &= \frac{\sum_{i = 1}^n (Z_i - \bar Z) (Y_i - \bar Y)}{\sum_{i=1}^n (Z_i - \bar Z)^2} & \hat \beta_0 &= \bar Y - \hat \beta_1 \bar Z, \end{align*}\] where \(\bar Y\) and \(\bar Z\) are the sample means of the two quantities.
Show that the OLS estimate of the treatment effect \(\hat \tau_\text{OLS} = \hat \beta_1\) is identical to the mean difference estimate, Definition 4.2.
In other words, in a randomized experiment, \(\hat \beta_1\) really does mean that increasing \(Z_i\) from 0 to 1 causes a change in the mean of \(Y\).
With covariates, the model must become a little more complex.
Example 6.2 (Superpopulation CRD with covariates) Let’s assume we have the treatment assignment \(Z_i\) plus a vector \(X_i\) of \(p\) covariates. We model \[ \E[C_i(j)] = \E[\E[Y_i \mid X_i, Z_i = j]], \] where the outer expectation is over the distribution of \(X\).
With two treatments, then, we need two separate models: \[\begin{align*} \E[Y_i \mid X_i, Z_i = 0] &= \beta_0 + X_i\T \beta\\ \E[Y_i \mid X_i, Z_i = 1] &= \beta_0^* + X_i\T \beta^*. \end{align*}\] Here the two models have different slopes and intercepts. But this is equivalent to having an interaction between \(Z_i\) and each covariate in \(X_i\): \[ \E[Y_i \mid X_i, Z_i] = \beta_0 + \beta_1 Z_i + X_i\T (\beta + \beta' Z_i). \] Compared to the previous form, \(\hat \beta_0^* = \beta_0 + \beta_1\) and \(\beta^* = \beta + \beta'\).
Hence we can build a linear model for a randomized experiment with covariates by including a treatment indicator, covariates, and the appropriate interaction. To estimate the average treatment effect, we must use our model estimates to average over \(X\).
Example 6.3 (Binary treatment, binary covariate) Consider conducting an experiment with binary treatment \(Z_i \in \{0, 1\}\) and a binary covariate \(X_i \in \{0, 1\}\). For instance, \(X_i\) could be a discrete variable we block on. We build a linear model with three covariates: \[ \E[Y_i \mid X_i, Z_i] = \beta_0 + \beta_1 Z_i + \beta_2 X_i + \beta_3 X_i Z_i. \] Consider a grid where each column is a treatment, each row is a covariate value, and the entries are the conditional means:
\(Z_i = 0\) | \(Z_i = 1\) | |
---|---|---|
\(X_i = 0\) | \(\beta_0\) | \(\beta_0 + \beta_1\) |
\(X_i = 1\) | \(\beta_0 + \beta_2\) | \(\beta_0 + \beta_1 + \beta_2 + \beta_3\) |
How do we calculate the average treatment effect from Definition 6.1? Following Example 6.1, we find that \[\begin{align*} \tsp ={}& \E[\E[Y_i \mid X_i, Z_i = 1] - \E[Y_i \mid X_i, Z_i = 0]]\\ ={}& (\E[Y_i \mid X_i = 1, Z_i = 1] - \E[Y_i \mid X_i = 1, Z_i = 0]) \Pr(X_i = 1)\\ &{}- (\E[Y_i \mid X_i = 0, Z_i = 1] - \E[Y_i \mid X_i = 0, Z_i = 0]) \Pr(X_i = 0), \end{align*}\] by the law of total expectation. In other words, the treatment effects conditional on \(X_i\) are the difference between the entries in the rows of the table. We can substitute in the values from our table to find that \[ \tsp = (\beta_1 + \beta_3) \Pr(X_i = 1) - \beta_1 \Pr(X_i = 0). \] We are, in effect, averaging the treatment effect from each row, proportional to how common those covariate values are in the population.
In practice, we would estimate \(\hat \beta_0\) through \(\hat \beta_3\) using OLS, and estimate \(\Pr(X_i)\) using the proportion in each group in our sample.
Exercise 6.2 (OLS estimate in blocked designs) Suppose that the covariate in Example 6.3 is a blocking variable. We can estimate the ATE using either the OLS approach shown or the blocked mean difference estimator, Definition 5.3. Show these are equivalent.
6.2 Variances in linear models
So far it appears that using ordinary least squares and the super-population approach yields identical point estimates to using treatment effect estimators in the finite population model. Before we decide to use least squares for everything,1 we should consider whether it also leads to the same inferences.
You’re probably familiar with the classic results for variance in OLS models. If we model \(\E[Y \mid X] = X \beta\) (where we’ve now added the intercept into \(X\)), the standard result is that \[ \var(\hat \beta \mid X) = \sigma^2 (X\T X)^{-1}, \] where \(\sigma^2\) is the variance of the error: \(Y = X \beta + \epsilon\), and \(\sigma^2 = \var(\epsilon)\). This result assumes that \(\var(\epsilon)\) is uncorrelated with \(X\); the more general result is that \[ \var(\hat \beta \mid X) = (X\T X)^{-1} X\T \var(Y \mid X) X (X\T X)^{-1}. \] This leads us to the sandwich estimator for the variance.
Definition 6.2 (Sandwich estimator) The sandwich estimator, also known as the Huber–White estimator, for the variance of \(\hat \beta\) in an ordinary least squares regression is \[ \widehat{\var}(\hat \beta \mid X) = (X\T X)^{-1} X\T \widehat{\var}(Y \mid X) X (X\T X)^{-1}, \] where we estimate \(\widehat{\var}(Y \mid X)\) as a diagonal matrix with potentially unequal entries: \[ \widehat{\var}(Y \mid X) = \begin{pmatrix} \hat \epsilon_1^2 & 0 & \cdots & 0 \\ 0 & \hat \epsilon_2^2 & 0 & \vdots \\ \vdots & \vdots & \vdots & \vdots\\ 0 & \cdots & 0 & \hat \epsilon_n^2 \end{pmatrix} \] Here \(\hat \epsilon_i = Y_i - \hat \beta\T X_i\), and these are variance estimates because \(\var(Y \mid X) = \var(X\beta + \epsilon \mid X) = \var(\epsilon \mid X)\), and \(\var(\epsilon) = \E[\epsilon^2] - \E[\epsilon]^2\), where the second term is assumed to be zero.
This estimator is valid even when the error variances are unequal, unlike the usual estimator.
We can show that when we use the sandwich estimator, the variance we obtain matches the conservative variance estimate we developed for randomized experiments in the finite population setting.
Exercise 6.3 (Variance in a binary experiment) Consider an experiment with two treatments, \(Z_i \in \{0, 1\}\), and no covariates. We use the OLS model shown in Example 6.1 and the estimators in Exercise 6.1. Show that \[ \widehat{\var}(\hat \beta_1 \mid X) = \frac{\sum_{i = 1}^n \hat \epsilon_i^2 (Z_i - \bar Z)^2}{\left( \sum_{i=1}^n (Z_i - \bar Z)^2 \right)^2} = \frac{s_1^2}{n_1} + \frac{s_0^2}{n_0}, \] where \(s_0^2\) and \(s_1^2\) are defined as in Theorem 4.2.
Because \(\tsphat = \hat \beta_1\), this says that \(\widehat{\var}(\tsphat \mid X)\) matches the variance we previously obtained in the finite population case.
In short, we’ve found several important results here:
- In simple binary experiments, OLS gives exactly the same estimators as we already used in completely randomized experiments.
- The usual OLS variance estimates, derived from the assumption that \(Y\) is a random variable drawn from a population, match the conservative variance estimates we derived for finite-population completely randomized experiments.
- When we include covariates, such as blocking variables, the OLS estimates also match the finite-population estimates.
So regardless of whether we believe in finite populations or superpopulations, we can justify using ordinary least squares and linear models.
6.3 The general linear model framework
The finite-population and superpopulation ideas grew up around the same time in the 1930s, but by the 1950s, the superpopulation concept with linear models was the dominant form of experimental design. I’m no historian of statistics, but one obvious reason is the great flexibility of linear models to accommodate all kinds of wacky designs, and the ease of deriving variances and tests using asymptotic results and creative linear algebra.
Linear models are traditionally written a little differently in experimental design than you may be used to. Consider an experiment with \(v\) treatments (which may be more than two). We might choose to model \[ Y_{it} = \mu + \tau_i + \epsilon_{it}, \] where \(i \in \{1, \dots, v\}\) is the treatment and \(t \in \{1, \dots, r_i\}\) is the experimental unit, and \(r_i\) is the number of units—the number of replications—in treatment \(i\). The errors \(\epsilon_{it}\) are assumed to be mean 0 and uncorrelated. Though above we allowed the errors to have unequal variance, it is common to assume that \(\var(\epsilon_{it}) = \sigma^2\), so all have equal variance. Together, the model implies that \[ \E[Y_{it}] = \mu + \tau_i. \]
You may already notice a problem with this model: it is not identifiable.
Definition 6.3 (Identifiability) Suppose a model is written as a function of some parameters: \(\E[Y] = f(\beta)\), where \(\beta\) is a vector of one or more parameters and \(f\) is an arbitrary function. The parameters are identifiable if for any \(\beta_1\) and \(\beta_2\), \(f(\beta_1) = f(\beta_2)\) implies \(\beta_1 = \beta_2\).
Or, in words: If the parameters are identifiable, then given a specific model, there is only one unique set of parameters that produces that model.
Exercise 6.4 (Non-identifiable model) The model \[ \E[Y_{it}] = \mu + \tau_i \] is not identifiable. Demonstrate this by giving an example of two sets of parameters that yield an identical model.
On the other hand, despite the model not being identifiable, the quantities we care about are estimable.
Definition 6.4 (Estimability) A function of the parameters of a linear model is estimable if it can be written as the expectation of a linear combination of the response variables.
For example, in the model above, \(\mu + \tau_i\) is estimable for any \(i\): simply take the expectation of any particular response with that \(i\). Indeed, all the quantities we need are estimable in the above model, even though it is not identifiable.
Exercise 6.5 (Estimability of the ATE) Consider a binary experiment. We model \[ Y_{it} = \mu + \tau_i + \epsilon_{it}, \] where \(i \in \{0, 1\}\) and \(t \in \{1, \dots, r_i\}\). What is the average treatment effect? Write it as an expectation, and then write it in terms of \(\mu\) and \(\tau\).
Nonetheless, software to do least squares estimates usually forces the model to be identifiable by imposing additional constraints. There are several common options:
- Require that \(\hat \mu = 0\). This makes the model identifiable and gives \(\tau_i\) easy interpretation.
- Force \(\hat \tau_1 = 0\) (or some other specific \(\tau\)). This matches R’s behavior: if treatment were coded as a binary factor, R fits an intercept (\(\mu\)) and a coefficient for one treatment level, but not the other.
- Require \(\sum_{i} r_i \hat \tau_i = 0\). TODO who does this?
With these constraints enforced, there is a unique least squares solution to estimate the parameters of the model. We can also think of these constraints as different ways to translate the model into the form \(\E[Y \mid X] = X \beta\), by constructing the right matrix \(X\) so components of \(\beta\) correspond to our parameters of interest.
The model can easily be extended to handle blocking and even multiple treatment factors. For example, we could model \[ Y_{ijt} = \mu + \tau_{ij} + \epsilon_{ijt}, \] where \(i \in \{1, \dots, v\}\) is one treatment level, \(j \in \{1, \dots, u\}\) is the treatment level for the other treatment factor, and \(t \in \{1, \dots, r_{ij}\}\) indexes the experimental units in that treatment combination. We’ll consider more complex multi-factor experiments in Chapter 7.
In any of these models, we will need a way to estimate \(\var(\epsilon) = \sigma^2\). That poses a problem. Suppose \(r_{ij} = 1\) for all \(i\) and \(j\). There is only one replicate for each block and treatment—only one experimental unit in each combination. In that case, the model is no longer identifiable! The errors \(\epsilon_{ijt}\) are not uniquely determined, and so there is no variance. We must have more than one replicate in at least one treatment combination to estimate the variance.
To estimate the variance, we can adapt the usual result from linear regression. An unbiased estimator is \[ \hat \sigma^2 = \frac{\RSS}{n - p} = \frac{\sum_{i=1}^v \sum_{j=1}^u \sum_{t=1}^{r_{ij}} \left(Y_{ijt} - (\hat \mu + \hat \tau_{ij})\right)^2}{n - (v + u)}. \] This is the usual residual sum of squares, rewritten for our form of model, replacing \(p\) with the number of unique parameters estimated in our model.
6.4 Contrasts
In linear models, the coefficients are not always the answer to your research question. By this I mean that the individual parameters \(\mu\) or \(\tau_i\) do not themselves answer your research question, which instead is often answered by a linear combination of the parameters. For example, in Exercise 6.5, we found that the average treatment effect is a linear combination of the parameters.
We will find we often need linear combinations of parameters, and so they are given a name: contrasts.
6.4.1 Definition and estimation
Definition 6.5 (Contrasts) In a linear model defined as in Section 6.3, a contrast is a linear combination \[ \sum_{i=1}^v c_i \tau_i, \qquad \text{where } \sum_{i=1}^v c_i = 0. \] The coefficient vector \(c = (c_1, \dots, c_v)\T\) is called the contrast vector.
All contrasts are estimable using ordinary least squares. As you might expect, the least squares estimator is \[ \sum_{i=1}^v c_i \left( \frac{1}{r_i} \sum_{t=1}^{r_i} Y_{it} \right) = \sum_{i=1}^v c_i \hat \tau_i. \]
With the language of linear models and contrasts, it is easy to move beyond binary treatments to experiments with arbitrarily many treatment arms.
Exercise 6.6 (Treatment contrasts) Consider conducting an experiment where there are 4 treatment arms (so \(i \in \{1, 2, 3, 4\}\)). Write contrast vectors that estimate:
- The average treatment effect for the difference between treatment 3 and treatment 1
- The difference between the average response for treatment 1 and the average response for treatments 2, 3, and 4, weighted equally
- The difference between the average response of treatments 1 and 2 and the average response of treatments 3 and 4.
6.4.2 Inference
Contrasts can be used for inference.
Theorem 6.1 (Variance of contrasts) Because they are simple linear combinations, we can calculate the variance of an estimated contrast as \[\begin{align*} \var\left( \sum_{i=1}^v c_i \hat \tau_i \right) &= \sum_{i=1}^v c_i^2 \var(\hat \tau_i)\\ &= \sum_{i=1}^v \frac{c_i^2}{r_i} \var(Y) \\ &= \sigma^2 \sum_{i=1}^v \frac{c_i^2}{r_i}. \end{align*}\]
As a linear combination of means, standard linear model theory and sampling theory justifies treating these estimates as approximately normal; when the errors are exactly normal, they are exactly normal. We can consequently estimate this variance and use it to conduct \(t\) tests for contrasts.
Theorem 6.2 (t test for a contrast) Let \(T = \sum_i c_i \tau_i\) be a contrast in an experiment with \(v\) treatments and \(r_i\) replicates per treatment. Let \(\hat T\) be its estimate and let \(n = \sum_i r_i\). Then, under the null hypothesis that \(T = t_0\), \[ \frac{\hat T - t_0}{\sqrt{\widehat{\var}(\hat T)}} \sim t_{n - v}, \] where the variance is estimated by plugging \(\hat \sigma^2\) into Theorem 6.1.
This result permits \(t\) tests for particular null hypotheses, such as the common null that \(T = 0\). We can also obtain confidence intervals from it: because the theorem implies \[ \Pr\left(t_{n - v, \alpha/2} \leq \frac{\hat T}{\sqrt{\widehat{\var}(\hat T)}} \leq t_{1 - \alpha/2}\right) = 1 - \alpha, \] we can obtain a \(1 - \alpha\) confidence interval for \(T\) as \[ \left[ \hat T + t_{n - v, \alpha/2} \sqrt{\widehat{\var}(\hat T)}, \hat T + t_{n - v, 1 - \alpha/2} \sqrt{\widehat{\var}(\hat T)} \right]. \]
We may be interested in multiple contrasts, in which case their covariance will be of interest as well.
Theorem 6.3 (Covariance of contrasts) Consider two contrasts vectors \(c\) and \(c'\). The covariance of the estimated contrasts is \[ \cov\left( \sum_{i=1}^v c_i \hat \tau_i, \sum_{i=1}^v c_i' \hat \tau_i \right) = \sigma^2 \sum_{i=1}^v \frac{c_i c_i'}{r_i}. \] Consequently, when \(c\) and \(c'\) are orthogonal—when \(\sum_i c_i c_i' = 0\)—the estimates are uncorrelated.
Proof. Recall a basic property of covariances: \[\begin{multline*} \cov(aX + bY, cW + dZ) = ac \cov(X, W) + ad \cov(X, Z) \\ + bc \cov(Y, W) + bd \cov(Y, Z). \end{multline*}\]
In the contrast case, all the cross-terms are zero, as \(\cov(\hat \tau_i, \hat \tau_j) = 0\) for \(i \neq j\). (Why?)
We’re left with \[\begin{align*} \cov\left( \sum_{i=1}^v c_i \hat \tau_i, \sum_{i=1}^v c_i' \hat \tau_i \right) &= \sum_{i=1}^v c_i c_i' \cov(\hat \tau_i, \hat \tau_i)\\ &= \sum_{i=1}^v c_i c_i' \var(\hat \tau_i)\\ &= \sum_{i=1}^v c_i c_i' \frac{\var(Y)}{r_i}\\ &= \sigma^2 \sum_{i=1}^v \frac{c_i c_i'}{r_i}. \end{align*}\]
Combined with a geometric interpretation of contrasts, this gives us an important result we will rely upon soon.
Theorem 6.4 (Basis for contrasts) By Definition 6.5, a contrast vector in an experiment with \(v\) treatments is a \(v\)-dimensional vector. Because of the constraint that \(\sum_i c_i = 0\), the contrast vectors span \((v - 1)\)-dimensional space.
We can hence form an orthogonal basis of \(v - 1\) contrast vectors, and any other contrast can be written as a linear combination of contrasts in that basis. This motivates Scheffé’s method for constructing a simultaneous 95% confidence interval for all possible contrasts.
We will return to contrasts in Chapter 7, when we see their many useful properties for estimation in multifactor experiments, and how those properties can guide our design.
6.5 Blocking
We can now consider blocked designs in linear models. We already know that blocking improves our experiments by increasing the power of tests and reducing the variance of estimates, but now we can see this through classic linear model theory, and the theory can give us much more specific guidance on how to design a blocked experiment.
6.5.1 The model
We’ve already seen a linear model for a blocked design in Example 6.3, but this model included an interaction. In classic experimental design, it’s common to start with a model that assumes purely additive effects: \[ Y_{hit} = \mu + \underbrace{\theta_h}_\text{block} + \underbrace{\tau_i}_\text{treatment} + \underbrace{\epsilon_{hit}}_\text{noise}, \] where \(h = 1, \dots, B\) indicates the block, \(i = 1, \dots, v\) indexes the treatments, and \(t\) indexes the experimental units within each block and treatment. Typically we assume \(\epsilon\) is independent with variance \(\sigma^2\), and often we assume it’s normally distributed.
This looks just like a two-factor ANOVA, and indeed it is. It can be trivially rewritten in the conventional \(Y = X\beta + \epsilon\) linear model form. But remember that blocking is not just a way to analyze data—it is part of the design of the experiment, and affects how we assign treatments.
Exercise 6.7 (Blocked model to ordinary linear model) Consider the model \[ Y_{hit} = \mu + \theta_h + \tau_i + \epsilon_{hit}, \] where there are 3 blocks and 4 treatments. We can write an equivalent model in the form \(Y = X\beta + \epsilon\), where \(\beta\) is a vector of coefficients and \(X\) is a matrix of covariates.
How many columns will \(X\) have, and what do those columns represent?
Though we’re including a blocking variable, the effect we are most interested in estimating is still the treatment effect; learning something about the blocking variable is just a bonus. Our contrasts, then, are still linear combinations of treatment means \(\tau_i\).
Definition 6.6 (Contrasts in blocked designs) In a blocked model with additive effects as defined above, a contrast is a linear combination \[ \sum_{i=1}^v c_i \tau_i, \] where the contrast coefficients sum to 0. All contrasts remain estimable with ordinary least squares, and the estimator is \[ \sum_{i=1}^v c_i \hat \tau_i = \sum_{i=1}^v c_i \bar Y_{\cdot i \cdot}, \] where \(\bar Y_{\cdot i \cdot}\) is the sample average over blocks and replicates.
6.5.2 Variance of blocked designs
If you’ve ever learned ANOVA the traditional way, you may have noticed how much it focuses on the partitioning of variance: breaking down the total variance in \(Y\) (the total sum of squares) into component parts, in the form of sums of squares attributed to various predictors. You may have wondered why we bother with that, because the sums of squares are not particularly useful in regression. But they do help us understand the benefit of certain experimental designs, which is perhaps why they have been taught for so long.
Recall that the prime quantity in ordinary least squares regression is the residual sum of squares, \[ \RSS = \sum_{i=1}^n (Y_i - \hat Y_i)^2, \] where \(\hat Y_i\) is the model’s prediction for observation \(i\). In this case, \[ \RSS = \sum_{h=1}^B \sum_{i=1}^v \sum_{t=1}^r (Y_{hit} - (\hat \mu + \hat \theta_h + \hat \tau_h))^2, \] where \(r\) is the number of replicates in each treatment/block. Notice we are requiring \(r\) to be the same in each block and treatment combination, making this a balanced design. That is crucial for the upcoming calculation.
The residual sum of squares is one component of the total sum of squares, \[ \text{SST} = \sum_{h=1}^B \sum_{i=1}^v \sum_{t=1}^r (Y_{hit} - \bar Y_{\cdot \cdot \cdot})^2, \] where a bar with a \(\cdot\) in a subscript indicates averaging over that subscript. The total sum of squares, then, is directly related to the variance of \(Y\).
Theorem 6.5 (Decomposition of the squared error) In balanced designs, \[ \text{SST} = B \sum_{i=1}^v \sum_{t=1}^r (\bar Y_{\cdot it} - \bar Y_{\cdot \cdot t})^2 + v \sum_{h=1}^B \sum_{t=1}^r (\bar Y_{h \cdot t} - \bar Y_{\cdot \cdot t})^2 + \RSS. \] This is a little hard to think about, so consider the case when \(r = 1\): \[ \text{SST} = B \underbrace{\sum_{i=1}^v (\bar Y_{\cdot i} - \bar Y_{\cdot \cdot})^2}_\text{treatment variation} + v \underbrace{\sum_{h=1}^B (\bar Y_{h \cdot} - \bar Y_{\cdot \cdot})^2}_\text{block variation} + \RSS. \]
In other words, in a balanced block design, the total variation in \(Y\) breaks down into three parts:
- Variation due to the treatment
- Variation due to the block
- Variation due to the error.
Exercise 6.8 (Prove it) Show the above identity by expanding out the definition of the total sum of squares, rearranging, and showing that the cross terms are 0.
Why do we care about sum-of-squares arithmetic? Well, once we run an experiment, the total sum of squares is fixed. If we don’t block on anything, all of that variation must be accounted for by the treatment and by the residual error—so if we block on something, the residual error must go down.
Another way to think of it: if we fit a model without the blocking variable, but the true relationship is blocked, then in the new model, \[ Y_{it} = \tau_i + \underbrace{\epsilon_{it}}_{\theta_h + \epsilon_{hit}}. \] The error must account for the variation caused by the block variable and by any measurement error, noise, sampling variation, and so on.
As you know, the typical unbiased estimator for the error variance in a linear model is \[ \hat \sigma^2 = \frac{\RSS}{n - p}, \] where \(p\) is the number of (identifiable) parameters in the model. Blocking hence reduces \(\hat \sigma^2\). This affects our contrasts, the primary quantities of interest in an experiment.
Theorem 6.6 (Variance of contrasts in blocked designs) By Definition 6.6, the variance of any contrast estimate is \[\begin{align*} \var\left( \sum_{i=1}^v c_i \hat \tau_i \right) &= \sum_{i=1}^v c_i^2 \var(\bar Y_{\cdot i \cdot})\\ &= \sum_{i=1}^v c_i^2 \frac{\var(Y_i)}{Br}\\ &= \sigma^2 \sum_{i=1}^v \frac{c_i^2}{Br}. \end{align*}\]
Compare this to Theorem 6.1: A design without blocks has \(r_i\) units in treatment \(i\), whereas the blocked design has \(r\) in each of \(B\) blocks, for \(rB\) units in treatment \(i\). But because the blocked design will have a smaller \(\sigma^2\), it will most likely have a smaller \(\hat \sigma^2\), and hence will produce estimates of greater precision.
6.5.3 Optimal blocked designs
These results begin to give us guidance for designing an experiment with blocking variables.
If our goal is to estimate a treatment effect using a contrast, the variance of that estimate depends on \(\sigma^2\). According to the variance decomposition, the best way to reduce \(\sigma^2\) is to choose a blocking variable that maximizes \[ \sum_{h=1}^B \sum_{t=1}^r (\bar Y_{h \cdot t} - \bar Y_{\cdot \cdot t})^2, \] the squared variation in response between blocks.
TODO anything else
6.6 Inference through ANOVA
In principle, if you’re familiar with the theory of inference in linear models, you know everything you need to know. Any model for a designed experiment can be written in the familiar \(Y = X \beta + \epsilon\) form of linear regression, and any contrast of \(\hat \tau_i\)s can be written as a combination of \(\hat \beta\)s. And we know how to do inference for \(\hat \beta\)—for example, see Section 5.2 of my regression notes.
But there is a great deal of specialized machinery for inference in experimental design, not because it is mathematically necessary but because it was computationally necessary. The machinery applies to special designs and not to arbitrary experiments, but the special cases often correspond to particularly desirable designs.
The computational necessity arose from the difficulty of calculating \(\hat \beta = (X\T X)^{-1} X\T Y\) in the days before computers. Once we have more than a few treatment and blocking factors, this requires doing large matrix multiplications, then inverting a large matrix, all by hand. Any further operation, such as producing confidence intervals for particular functions of \(\hat \beta\), requires even more matrix and vector operations.
By contrast,2 we will see that the special cases don’t require any matrix operations, and often make inference as easy as inference in a univariate regression.
6.6.1 The variance decomposition
As we saw in Theorem 6.5, the total sum of squares in balanced experiment can be decomposed into components for the treatment and blocking factors. The theorem generalizes to cases with multiple treatment and blocking factors, provided there is full balance.
This suggests we can organize the crucial information in a table. Because we’re working with variances, this is Analysis of Variance, and the table is called an ANOVA table.
In an experiment with one treatment factor with \(v\) levels, one blocking factor with \(B\) levels, and \(r\) replicates per block/treatment combination, the table might look like this:
Variation | DF | Sum of squares | Mean square | Ratio |
---|---|---|---|---|
Block | \(B - 1\) | \(\text{SSB} = v r \sum_h \bar Y_{h\cdot \cdot}^2 - Bvr \bar Y_{\cdot \cdot \cdot}^2\) | \(\text{MSB} = \text{SSB}/(B - 1)\) | \(\text{MSB} / \RSS\) |
Treatment | \(v - 1\) | \(\text{SSR} = Br \sum_i \bar Y_{\cdot i \cdot}^2 - Bvr \bar Y_{\cdot \cdot \cdot}^2\) | \(\text{MSR} = \text{SSR} / (v - 1)\) | \(\text{MSR} / \RSS\) |
Error | \(Bvr - B - v + 1\) | \(\RSS\) | \(\text{MSE} = \RSS/(Bvr - B - v + 1)\) | |
Total | \(Bvr - 1\) | \(\text{SST}\) |
The “sum of squares” column makes sense following Theorem 6.5. But what about the others?
The DF is the Degrees of Freedom. There are many ways to think of this, but a good one is in terms of geometry. In this experiment, there are \(Bvr\) observations of \(Y\). We can hence think of \(Y\) as a vector in \(Bvr\)-dimensional space. Now consider writing the model in the form \[ Y = X \beta + \epsilon, \] where \(X\) is an \(Bvr \times p\) design matrix and both \(Y\) and \(\epsilon\) are \(Bvr\) vectors. \(\beta\) is a \(p\)-vector, where \(p\) is the number of estimable parameters. We can break this matrix multiplication into pieces for the model components: \[ Y = \mathbf{1} \mu + X_\text{block} \theta + X_\text{treat} \tau + \epsilon, \] where \(X_\text{block}\) is a \(Bvr \times (B - 1)\) matrix of block indicators and \(X_\text{treat}\) is a \(Bvr \times (v - 1)\) matrix of treatment indicators.
So we are expressing \(Y\) in terms of a basis consisting of the vector \(\mathbf{1}\), the columns of \(X_\text{block}\), and the columns of \(X_\text{treat}\). Then:
- \(\mathbf{1}\mu\) describes a 1-dimensional space (a line).
- \(X_\text{block} \theta\) describes a \((B-1)\)-dimensional space.
- \(X_\text{treat} \tau\) describes a \((v - 1)\)-dimensional space.
TODO
The crucial fact is that when the design is balanced, each of the subspaces mentioned above are orthogonal to each other. You will prove this in Exercise 6.9.
6.6.2 F tests
Because of the geometry, we can construct convenient hypothesis tests using the sums of squares.
TODO
6.6.3 Multiple testing
When we conduct an experiment with many treatment levels, there may be many contrasts of interest to us. The \(F\) test for treatment tests an overall null hypothesis, but we may also be interested in comparisons of specific combinations of contrasts. However, this can be risky.
Suppose we set \(\alpha = 0.05\) for our tests. We conduct \(k\) hypothesis tests. Suppose they are independent, and suppose the null hypothesis is true for every test. Then the probability of getting at least one significant result is one minus the probability of getting exactly zero significant results, so \[ \Pr(\text{one of $k$ is significant}) = 1 - (1 - \alpha)^k, \] This is graphed in Figure 6.1, showing that with even a modest number of contrasts, there’s a high risk of false positive results..
Now, depending on the contrasts, the tests may not be independent of each other, but that may just make the problem worse. If we are to run many contrasts, we must adjust our tests to keep our overall false positive rate controlled at a reasonable level. This overall rate is often called the “family-wise error rate”, since you are doing a “family” of related tests and want to control the overall error rate. There are several options.
6.6.3.1 Bonferroni correction
The Bonferroni correction is based on Boole’s inequality. If \(\Pr(A_i)\) represents the probability of some event \(A_i\), then \[ \Pr\left( \cup_{i=1}^k A_i \right) \leq \sum_{i=1}^k \Pr(A_i). \] That is: the probability that at least one event occurs is no larger than the sum of their probabilities. This inequality holds whether the events are independent or dependent.
Consequently, \[ \Pr(\text{one of $k$ is significant}) \leq k \alpha. \] To ensure the family-wise error rate is no larger than \(\alpha\), then, you must divide \(\alpha\) by \(k\). Use \(0.05/k\) when testing \(k\) contrasts, for instance, and the FWER should be no larger than 0.05.
This is known as the Bonferroni correction. It can be done either by reducing the significance threshold or by multiplying all \(p\) values in the study by \(k\) and using the original \(\alpha\) as the significance threshold. It can also be applied to confidence intervals to ensure they are joint confidence intervals.
The Bonferroni correction is conservative: It works regardless of the dependence between tests, and so it assumes the worst case. This means it is stricter than it has to be in specific situations, and results in lower power for tests. That’s why we often turn to procedures specifically derived for certain tests, as they can be more powerful while still controlling the FWER.
6.6.3.2 Tukey’s HSD
Sometimes we are only interested in pairwise comparisons between treatments: How does treatment 1 compare to treatment 2? What about treatment 4 versus treatment 7? Tukey proposed the “honestly significant differences” procedure that can produce confidence intervals and tests for all possible pairwise comparisons while still controlling the FWER.
Definition 6.7 (Tukey’s HSD) In a randomized experiment with \(v\) treatments, a \(1-\alpha\) confidence interval for \(\tau_i - \tau_j\) for \(i \neq j\) is \[ (\bar Y_{i \cdot} - \bar Y_{j \cdot}) \pm w_T \sqrt{\hat \sigma^2 \left( \frac{1}{r_i} + \frac{1}{r_j}\right)}, \] where \(r_i\) and \(r_j\) are the numbers of units in each treatment and \[ w_T = \frac{q_{v,n-v,1-\alpha}}{\sqrt{2}}, \] and \(q_{v,n-v,1-\alpha}\) is the \(1-\alpha\) quantile of the Studentized range distribution with \(v\) groups and \(n-v\) degrees of freedom. This confidence interval provides at least \(1 - \alpha\) coverage.
The distribution’s quantiles can be calculated with qtukey()
in R.
Proof. For simplicity, consider the case where all treatments have \(r\) experimental units. Then define the statistics \(T_i = \bar Y_{i \cdot} - (\mu + \tau_i)\), which give the difference between the sample mean for treatment \(i\) and the population mean. If the errors are normally distributed, or if we believe the central limit theorem, these statistics will be (perhaps approximately) normally distributed. They are also independent. Then consider the statistic \[ Q = \frac{\max\{ T_i \} - \min \{T_i \}}{\sqrt{\hat \sigma^2 / r}}. \] Here we are taking the difference between the largest and smallest of \(v\) normally distributed random variables, and dividing by an estimate of the standard deviation. One can work out the distribution of this quantity, and it’s called the Studentized range distribution. Consequently, \[ \Pr(Q \leq q_{v, n-v, 1-\alpha}) = 1 - \alpha. \]
Now, how do we apply this to a particular comparison of two treatment means? If the largest difference between any \(T_i\) and \(T_j\) is \(|\max\{T_i\} - \min \{T_i\}|\), then it follows that any individual \(|T_i - T_j|\) must be less than or equal to that difference. Therefore, \[ \Pr\left(-q_{v,n-v,1-\alpha} \sqrt{\hat \sigma^2/r} \leq T_i - T_j \leq q_{v,n-v,1-\alpha} \sqrt{\hat \sigma^2/r}\right) \geq 1 - \alpha. \] We can hence define the confidence interval given in the definition.
Naturally, the intervals can be inverted to produce hypothesis tests for the pairwise differences, again controlling the family-wise error rate. Confidence intervals from Tukey’s HSD are less conservative (narrower) than Bonferroni intervals, and tests are more powerful.
6.6.3.3 Scheffé’s method
Scheffé (1953) introduced a method for constructing simultaneous confidence intervals for all possible contrast vectors.
Definition 6.8 (Scheffé’s method) In a randomized experiment with \(v\) treatments and \(r_i\) replicates per treatment, a \(1 - \alpha\) confidence interval for any contrast represented by contrast vector \(c\) is \[ \sum_{i=1}^v c_i \bar Y_{i \cdot} \pm \sqrt{(v - 1) F_{v-1,n-v, 1-\alpha}} \sqrt{\hat \sigma^2 \sum_{i=1}^v \frac{c_i^2}{r_i}}, \] where \(F_{v-1,n-v,1-\alpha}\) is the \(1-\alpha\) quantile of an \(F\) distribution with \(v-1\) and \(n-v\) degrees of freedom.
All such confidence intervals for an experiment will jointly have \(1 - \alpha\) coverage.
Proof. The proof is somewhat involved, but see Klotz (1969) for a simple proof in a special case.
Typically, these confidence intervals are the most conservative: Tukey’s HSD is specialized for only one set of possible contrasts, while Scheffé’s method must account for any possible contrast. The Bonferroni correction is less conservative provided you only do a few contrasts, but of course becomes more and more conservative the more tests or confidence intervals you construct.
6.7 Estimation in R
Now that we’ve seen the math, let’s consider how to analyze a designed experiment in R. Let’s use a real experiment and review how we fit the linear model, select the contrasts of interest, and produce confidence intervals and tests.
Example 6.4 (Resting metabolic rate) Dean, Voss, and Draguljić (2017) (example 10.4.1) present data collected by Bullough and Melby (1993), who conducted an experiment to compare different methods for measuring resting metabolic rate (RMR). Resting metabolic rate measures a person’s rate of energy consumption, and can be measured by carefully measuring the oxygen and carbon dioxide they breathe in and out; as the body consumes energy, it produces carbon dioxide, so the rate of production can be converted into energy use. Metabolic rate measurements are used in various medical and biological studies, but there was disagreement about the best way to measure resting rate: should the experimental subject spend the night in the laboratory, to be measured in the morning, or can they drive into the lab in the morning for the measurement? The extra activity may make them consume more energy, changing the measurements.
The experiment tested three methods:
- Inpatient, meal-controlled: The subject ate a meal provided by the experimenters, then was brought to the lab to stay the night before being measured in the morning.
- Outpatient, meal-controlled: The subject ate a meal provided by the experimenters, spent the night at home, and was driven to the lab in the morning for measurement.
- Outpatient: The subject ate their own meal at home, spent the night at home, and were driven to the lab in the morning for measurement.
The experiment used a within-subject design: There were 9 subjects, and they were measured all 3 ways in random order, on separate days.
The data is provided in resting.metabolic.rate.txt
:
<- read.table("data/resting.metabolic.rate.txt",
rmr header = TRUE)
head(rmr)
subject protocol rate
1 1 1 7131
2 1 2 6846
3 1 3 7095
4 2 1 8062
5 2 2 8573
6 2 3 8685
The subject
column identifies the subject (1 through 9), the protocol
identifies the method (as numbered above), and rate
is the measured resting metabolic rate in kJ/day.
It seems reasonable to think of this as an experiment with \(9 \times 3 = 27\) observations. We’d expect there to be lots of variation between subjects—people of different size, metabolism, age, and so on may have very different metabolic rates—and so it would be reasonable to block on size. Randomly ordering the measurements within subjects can be thought of as randomly assigning treatments within the blocks.
In principle, in R we can use the lm()
function to fit the linear model, and translate between the \(\hat \beta\)s it estimates and the contrasts we’re interested in. But if we use aov()
(for analysis of variance), we get the results presented in the standard ANOVA form.
Example 6.5 (ANOVA of Example 6.4) To fit the model and print the ANOVA table, we use aov()
and summary()
to get a nice table. The model formula is rate ~ subject + protocol
: rate
is the response, and we list both the treatment factor and blocking variable on the right hand side. But first we must make subject
and protocol
factor variables, so the numeric coding isn’t mistaken for a continuous variable.
$subject <- factor(rmr$subject)
rmr$protocol <- factor(rmr$protocol)
rmr<- aov(rate ~ subject + protocol, data = rmr)
fit summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
subject 8 23117462 2889683 37.423 6.3e-09 ***
protocol 2 35949 17974 0.233 0.795
Residuals 16 1235483 77218
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The tests suggest that the blocking factor (subject) is important and explains much of the variance in \(Y\), but the protocol does not. We might report this in text by saying:
The measurement protocol was not significantly associated with the resting metabolic rate measurement (\(F(2, 16) = 0.23\), \(p = 0.79\)).
Notice we’ve presented the \(F\) statistic, both degrees of freedom, and the \(p\) value. Always report results by giving a substantive interpretation, then the test statistic and \(p\) value that justify it.
Because aov()
uses lm()
underneath, the fit
object is a special subclass of an lm
fit:
class(fit)
[1] "aov" "lm"
This means it supports all the usual methods you use for linear models: residuals()
, predict()
, coefficients()
, and, most importantly for us, vcov()
.
Of course, hypothesis tests are not our only goals. We may also want to make confidence intervals for specific contrasts.
Example 6.6 (Contrasts of Example 6.4) There are two ways to handle contrasts in R. In the first approach, we consider testing individual contrasts. We can use lm()
to test the default contrasts, which compare each factor level to the first:
summary(lm(rate ~ subject + protocol, data = rmr))
Call:
lm(formula = rate ~ subject + protocol, data = rmr)
Residuals:
Min 1Q Median 3Q Max
-329.0 -191.7 24.3 131.0 514.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6975.04 177.37 39.325 < 2e-16 ***
subject2 1416.00 226.89 6.241 1.18e-05 ***
subject3 89.33 226.89 0.394 0.69898
subject4 400.67 226.89 1.766 0.09648 .
subject5 2061.67 226.89 9.087 1.02e-07 ***
subject6 198.00 226.89 0.873 0.39575
subject7 669.67 226.89 2.952 0.00938 **
subject8 2862.67 226.89 12.617 9.92e-10 ***
subject9 875.67 226.89 3.859 0.00139 **
protocol2 87.56 130.99 0.668 0.51341
protocol3 59.33 130.99 0.453 0.65667
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 277.9 on 16 degrees of freedom
Multiple R-squared: 0.9493, Adjusted R-squared: 0.9177
F-statistic: 29.98 on 10 and 16 DF, p-value: 1.786e-08
The protocol2
and protocol3
rows correspond to the contrasts for \(\tau_2 -
\tau_1\) and \(\tau_3 - \tau_1\). We can view the coding R used using the contrasts()
function:
contrasts(rmr$protocol)
2 3
1 0 0
2 1 0
3 0 1
There are three rows, corresponding to the three levels. The columns are not contrast vectors in the sense of the coefficients \(c_i\); instead, they show how each factor level is represented in the design matrix \(X\) when R fits the model \(Y = X \beta + \epsilon\). We can see this explicitly using the model.matrix()
function to view the \(X\) created for the model fit:
model.matrix(rate ~ protocol, data = head(rmr))
(Intercept) protocol2 protocol3
1 1 0 0
2 1 1 0
3 1 0 1
4 1 0 0
5 1 1 0
6 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$protocol
[1] "contr.treatment"
The protocol1
and protocol2
columns correspond to the columns returned by contrasts()
, with the values chosen corresponding to the factor level of each experimental unit. We see that unit 1 was assigned protocol 1, unit 2 was assigned protocol 2, and unit 3 was assigned protocol 3.
We can change the contrasts by assigning to the contrasts()
function:
contrasts(rmr$protocol) <- contr.sum(levels(rmr$protocol))
The contr.sum()
function produces “sum to zero” contrasts:
contrasts(rmr$protocol)
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
Notice the columns here are explicitly constructed to be orthogonal to the intercept (a column of 1s). If your experiment has equal numbers of each protocol level, these columns of the design matrix would be orthogonal to the intercept column.
If we fit the model again, the coefficients and tests provided correspond to the two new contrasts:
summary(lm(rate ~ subject + protocol, data = rmr))
Call:
lm(formula = rate ~ subject + protocol, data = rmr)
Residuals:
Min 1Q Median 3Q Max
-329.0 -191.7 24.3 131.0 514.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7024.00 160.43 43.781 < 2e-16 ***
subject2 1416.00 226.89 6.241 1.18e-05 ***
subject3 89.33 226.89 0.394 0.69898
subject4 400.67 226.89 1.766 0.09648 .
subject5 2061.67 226.89 9.087 1.02e-07 ***
subject6 198.00 226.89 0.873 0.39575
subject7 669.67 226.89 2.952 0.00938 **
subject8 2862.67 226.89 12.617 9.92e-10 ***
subject9 875.67 226.89 3.859 0.00139 **
protocol1 -48.96 75.63 -0.647 0.52655
protocol2 38.59 75.63 0.510 0.61682
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 277.9 on 16 degrees of freedom
Multiple R-squared: 0.9493, Adjusted R-squared: 0.9177
F-statistic: 29.98 on 10 and 16 DF, p-value: 1.786e-08
In the second approach, we can use the emmeans
package, which supports multiple comparison procedures for contrasts. For example, the default pairwise contrast uses Tukey’s adjustment of the \(p\) values for all three pairwise contrasts.
library(emmeans)
<- emmeans(fit, "protocol")
fit_emm contrast(fit_emm, method = "pairwise")
contrast estimate SE df t.ratio p.value
protocol1 - protocol2 -87.6 131 16 -0.668 0.7848
protocol1 - protocol3 -59.3 131 16 -0.453 0.8938
protocol2 - protocol3 28.2 131 16 0.215 0.9748
Results are averaged over the levels of: subject
P value adjustment: tukey method for comparing a family of 3 estimates
6.8 Exercises
Exercise 6.9 (Geometry of experiments) You may know that linear regression can be thought of geometrically: in a linear model \(Y = X \beta + \epsilon\), with \(X\) an \(n \times p\) matrix of covariates, the least squares solution amounts to a projection of \(Y\) onto the column space of \(X\). That projection is \(\hat Y\), and \(\hat \beta\) gives its coordinates in the column space of \(X\).
Consider a balanced block design with a binary treatment and a binary blocking variable, so the linear model is \[ Y_{hit} = \mu + \theta_h + \tau_i + \epsilon_{hit}. \] There is no interaction, and there are an equal number of replicates in each block-treatment combination.
- This model can be rewritten in \(Y = X \beta + \epsilon\) form by appropriately defining the columns of \(X\). State how many columns \(X\) will have and give a working definition for each column.
- Because there are only a few possible combinations of blocks and treatments, the design matrix is \(r\) repeats of a few rows. Write out these rows.
- Your design matrix \(X\) should contain a column for the blocks and a column for the treatment. These columns are correlated with the intercept/grand mean column. But we can make them orthogonal by using the Gram–Schmidt process. If you have two vectors \(u\) and \(v\), you can make them orthogonal by projecting one onto the other, and subtracting out the projection: \[ u' = u - \frac{u\T v}{v\T v} v. \] Use Gram–Schmidt to make the block and treatment columns orthogonal to the intercept/grand mean column. Write out the new design matrix.
- Are the new columns orthogonal to each other? Calculate their inner products.
- If the blocks were not balanced, would your result change? Why or why not?
Exercise 6.10 (Blocking in R) Return to the sim-crd.csv data file we introduced in Exercise 4.8.
Use your
completely_randomized()
andblock_randomized()
functions (from Exercise 4.6 and Exercise 5.3) to generate two datasets: One representing a completely randomized experiment on the data, and one representing a block-randomized experiment, blocking onprior_grade
.Report:
- The estimated treatment effect in the completely randomized experiment, using Definition 4.2.
- The estimated treatment effect in the block randomized experiment, using Definition 5.3.
- The number of blocks in the blocked experiment and the number of units in each treatment in each block. Is this a balanced block design?
Use
aov()
to fit two linear models: one to the completely randomized design, ignoringprior_grade
, and one to the blocked design, usingprior_grade
as a variable in the analysis. Print the two ANOVA tables. Comment on the sums of squares:- Does
prior_grade
appear to account for much of the variance in the response? - Does the
assignment
row change much between the two fits? In particular, consider the sum of squares and the \(F\) statistic: which do you expect to be different and why?
- Does
Repeat the fits using
lm()
and examine the model summary. We’re particularly interested in the effect ofassignment
. What is different in the estimates and standard errors between the two fits? Write a sentence interpreting the effect in the blocked model fit. Include a 95% confidence interval.
Exercise 6.11 (Defining contrasts) TODO extend Exercise 6.10 to ask for R code to set contrasts for desired quantities
Exercise 6.12 (Balanced block designs) In 2017, researchers discovered security vulnerabilities affecting every computer using common processor architectures (x86 in particular), in which standard processor features could be abused by attackers to get access to data that the operating system would not otherwise grant them access to. (Actually there were several related vulnerabilities, going under the names Meltdown and Spectre.)
Suppose you are an operating system manufacturer testing ways of protecting against the security vulnerabilities. You have two different possible mitigation strategies which you want to compare against an unprotected version. You’re worried that the mitigations may make software running on computers slower, and want to test various programs to see how fast they run with and without mitigations.
Different programs may be affected differently by the mitigations, so you decide to test five different programs under each of the three conditions: two different mitigations and the unprotected control version. You time how long each program takes to run. This time varies for various reasons, so you need to run each several times. But the programs are slow, and you only have time to make 45 total runs.
Describe the design you would use for this experiment. What are the blocks, what are the treatment groups, and how would you handle random assignment to ensure balance?
Let’s simulate some data. Below is the R function you should use to generate the data. It takes
mitigation
andprogram
arguments and returns the number of minutes taken to run that program with that mitigation, in one particular experimental run.# Simulate the run time of a mitigation for a particular program. # There are three mitigations: # 0: no mitigation # 1: one experimental mitigation strategy # 2: the other experimental strategy # There are five different programs, numbered 1 through 5. <- function(mitigation, program) { run_time 40 + 5 * mitigation + c(-8, 6, 14, -16, 4)[program] + rnorm(1, sd = 5) }
Simulate data following the design you chose in part 1, recording the results into a data frame. Remember the constraint that you can only make 45 total runs, and remember to store the program and mitigation as factors, not numbers.
Note: Put your simulation code in a function so you can call it many times to get many simulated data frames. We’ll use this later in the problem.
Use
lm
to estimate the effect of mitigation on runtime, ignoring the program variable entirely. Examine the standard errors reported bysummary
. Now run a second model where you do include the program variable. What happens to the estimates and standard errors of the effect of the mitigations?(The
summary
function returns a list including acoefficients
matrix containing the standard errors, if that’s helpful to you.)Use the
vcov
function to get the covariance matrix of the estimates. What do you notice about the covariance between mitigation and program parameters?Now let’s examine an unbalanced design. Generate a data frame like you did in part 2, with the same number of rows, but this time do not enforce balance. Instead, for each observation, randomly pick a program and a mitigation. Fit the same models as before and examine the estimates, standard errors, and covariances again. What do you observe?
Of course, it’s hard to tell if differences in the standard errors are from this sample or are generic. Run 1,000 simulations from the balanced and unbalanced designs. For each simulation, fit the model with both mitigation and program factors, and record the standard errors on the mitigation coefficients. (Don’t include the intercept.) Report or plot your results. Calculate the average percent increase in standard error resulting from the unbalanced design.
Exercise 6.13 (When is blocking beneficial?) In Example 6.3, we considered having a binary treatment \(Z_i \in \{0, 1\}\) and a binary blocking variable \(X_i \in \{0, 1\}\). We can describe all possible cell means with the model \[ \E[Y_i \mid X_i, Z_i] = \beta_0 + \beta_1 Z_i + \beta_2 X_i + \beta_3 X_i Z_i. \] This corresponds to a table of cell means:
\(Z_i = 0\) | \(Z_i = 1\) | |
---|---|---|
\(X_i = 0\) | \(\tau_{00} = \beta_0\) | \(\tau_{01} = \beta_0 + \beta_1\) |
\(X_i = 1\) | \(\tau_{10} = \beta_0 + \beta_2\) | \(\tau_{11} = \beta_0 + \beta_1 + \beta_2 + \beta_3\) |
Suppose we conduct an experiment with \(r\) replicates in each cell, so there are \(n = 4r\) total experimental units. There are two reasonable estimators for the average treatment effect of \(Z\):
- Ignore blocking. Estimate \(\hat \tau_\text{noblock} = \bar Y_{\cdot 1} - \bar Y_{\cdot 0}\), where \(\bar Y_{\cdot 1}\) represents the average \(Y\) for the \(2r\) replicates with \(Z_i = 1\).
- Use blocking. Estimate \(\hat \tau_\text{block}\) using the blocked mean difference estimator, Definition 5.3.
These correspond to fitting linear models with only a \(Z\) term (\(\hat \tau_\text{noblock}\)) or with both \(X\) and \(Z\) (\(\hat \tau_\text{block}\)).
Assume that the variance of \(Y\) within each cell is \(\var(Y \mid X, Z) = \sigma^2\). Note that omitting \(X\) in the unblocked model means the variation due to \(X\) is absorbed in the error term—so the variance of the error in the unblocked model is not \(\sigma^2\).
- Derive a formula for \(\hat \tau_\text{block}\) in terms of \(\bar Y\).
- Calculate \(\var(Y \mid Z = 0)\) and \(\var(Y \mid Z = 1)\) in terms of the \(\beta\) parameters. Notice that \(\var(Y \mid Z) \neq \var(Y \mid X, Z)\). (Hint: Recall the law of total variance, and recall we have \(r\) replicates in each cell.) What does this imply about the variance of the errors in the unblocked model?
- Calculate \(\var(\hat \tau_\text{noblock})\) and \(\var(\hat \tau_\text{block})\).
- Consider \(\beta_2\) and \(\beta_3\), which together describe the influence of the blocking variable \(X\). Under what condition does \(\var(\hat \tau_\text{noblock}) = \var(\hat \tau_\text{block})\)? Which of \(\beta_2\) and \(\beta_3\) has the strongest effect on the variances?
Exercise 6.14 (Respiratory exchange ratio) In Example 6.4, we introduced the resting metabolic rate experiment. The experimenters also measured the respiratory exchange ratio of the subjects, which is the ratio between their production of carbon dioxide and consumption of oxygen. This gives an indication of which sources of energy in the body are being used, and is used in various exercise and stress tests.
The data is given in respiratory.exchange.ratio.txt
. Load the data using read.table()
:
<- read.table("data/respiratory.exchange.ratio.txt",
rer header = TRUE)
Because the data was collected as part of the same experiment, it has the same block design as in Example 6.4.
- Use
aov()
to model respiratory exchange ratio, being sure to include the blocking factor. Print the ANOVA table. Comment on the sums of squares: does the blocking factor account for a lot of the variance? - Our inferences are based on the assumption that either (a) the errors are normally distributed or (b) the sample size is large enough that the central limit theorem ensures parameter estimates will be approximately normal. Evaluate option (a) by making a Q-Q plot of the residuals \(\hat
\epsilon_{ij}\). (You can use the
residuals()
function to obtain these.) - Report the results of an \(F\) test for the null hypothesis that all measurement protocols have the same mean.
- Calculate all pairwise contrasts using the Tukey HSD method. Does your result match the result from part 3?