8  Power

In most experiments, our goal is to estimate the effect of one or more treatments—either by testing whether the treatment effects are zero or by obtaining confidence intervals for their sizes. Before we conduct an experiment, then, we must ask: How many experimental units do we require?

This question can be phrased in two ways:

  1. How many experimental units do we need to reject the null hypothesis of no treatment effect with high probability, provided there really is a treatment effect?
  2. How many experimental units do we need to estimate the treatment effect with a confidence interval of a desired width?

A great deal of effort has been expended on the first question, though in my view, the second question is more important. Confidence intervals are more useful than hypothesis tests in general; failing to reject a null hypothesis is much more informative when you know the precision of the estimate.

Nonetheless, let’s consider the first question first.

8.1 Power of F tests

8.1.1 One-way experiments

Suppose your research question can be answered with a partial \(F\) test. For instance, in a simple experiment with one treatment factor with three levels, you might be interested in the null hypothesis \[ H_0: \tau_1 = \tau_2 = \tau_3, \] reflecting all treatment means being the same. If treatment 1 is a control and treatments 2 and 3 are new treatments, this null represents the new treatments working no better than control.

A partial \(F\) test for this null would amount to fitting the model with the factor and the model with only an intercept, then calculating the \(F\) statistic to compare them. The standard result is that under \(H_0\), the \(F\) statistic follows an \(F_{q,n-p}\) distribution, where \(q\) is the difference in the number of parameters between the models, \(p\) is the number of parameters in the larger model, and \(n\) is the number of experimental units.

It turns out we can also obtain the \(F\) statistic’s distribution under an alternative hypothesis.

Theorem 8.1 (Alternative distribution for the partial F test) In a partial \(F\) test for a treatment factor with \(v\) levels, each with \(r_i\) replicates, the \(F\) statistic has distribution \[ F_{v-1, n-v, \delta^2}, \] where this represents the noncentral \(F\) distribution, and \[\begin{align*} \delta^2 &= \frac{(v - 1) Q(\tau)}{\sigma^2}\\ Q(\tau) &= \frac{\sum_{i=1}^v r_i \left(\tau_i - \frac{1}{n}\sum_{h=1}^v r_h \tau_h\right)^2}{v-1}. \end{align*}\] We can think of \(Q(\tau)\) as representing the variance of the treatment means. \(\delta^2\) is the noncentrality parameter, which defines the shape of the noncentral \(F\) distribution along with the two degree of freedom parameters.

We can use Theorem 8.1 to calculate the power of the \(F\) test under a particular alternative hypothesis. To do so, we calculate the critical value of the null \(F\) distribution. We then calculate the probability of obtaining an \(F\) statistic equal to or larger than the critical value under the alternative.

Example 8.1 (Power in a simple binary experiment) Consider a simple experiment with \(v = 2\) treatments, each replicated \(r = 4\) times. We can calculate the critical value of the \(F\) statistic under the null:

v <- 2 # treatments
r <- 4 # replicates
n <- r * v # total sample size
alpha <- 0.05 # desired alpha level

f_crit <- qf(1 - alpha, v - 1, n - v)
f_crit
[1] 5.987378

Now we consider the alternative hypothesis where \(\tau_2 - \tau_1 = 1\). We might choose this particular alternative because we believe a difference of 1, in whatever units the response is measured in, is the smallest meaningful or important difference; if the true difference is smaller, we don’t care about detecting it.

Now we find the probability of obtaining a value equal to or more extreme than this one under the alternative hypothesis, using the noncentral \(F\). We will need to obtain \(\delta^2\), \(Q\), and an estimate of \(\sigma^2\). Because \(r_i = r = 4\) for all \(i\), the difference between \(\tau_i\) and its mean is \(1/2\) for all \(i\), so we simplify this to:

sigma2 <- 2
Q <- n * (1/2)**2 / (v - 1)
delta2 <- (v - 1) * Q / sigma2

To find the power, i.e. the probability of obtaining an \(F\) statistic larger than f_crit, we use pf with the ncp argument to set the noncentrality parameter:

power <- pf(f_crit, v - 1, n - v, ncp = delta2, lower.tail = FALSE)
power
[1] 0.1356114

Evidently we have only 14% power in this scenario. If we double the number of replicates, we get:

r <- 8
n <- r *  v
f_crit <- qf(1 - alpha, v - 1, n - v)
Q <- n * (1/2)**2 / (v - 1)
delta2 <- (v - 1) * Q / sigma2
pf(f_crit, v - 1, n - v, ncp = delta2, lower.tail = FALSE)
[1] 0.2610284

We can also visualize the power in Figure 8.1. Here we’ve plotted the density of the null and alternative \(F\) distributions, along with the critical value. We can see that despite the null hypothesis being false, small \(F\) statistics still have high density. This experiment is underpowered.

Figure 8.1: \(F\) statistic distribution under the null hypothesis (black) and alternative hypothesis (red). The area under the black curve to the right of the dashed line is 5%; the same area under the red curve is the power.

Naturally, we can choose the necessary sample size for our experiment by setting a desired power level and then finding the \(r\) necessary to achieve that power. There is no uniform rule for the “right” power of an experiment. If you really care about whether a treatment effect exists, you should insist on a power near 100%; if you must conduct many experiments cheaply and are willing to accept missing true signals some of the time, you might accept lower power. The most commonly used power threshold is 80%, probably because it’s a nice round number that someone proposed early in the literature and everyone else copied.

Also, the procedure above requires us to specify the value of every \(\tau\) under the alternative. In a treatment with several levels, this is inconvenient. If I’m interested in testing whether any treatment is different from the others by a certain amount, do I need to consider every possible configuration of \(\tau\) that includes a difference of that amount?

Fortunately we can rearrange Theorem 8.1 to handle the worst-case alternative hypothesis.

Theorem 8.2 (Worse-case alternative distribution for the \(F\) test) Consider a one-way experiment with \(v\) levels. There are many possible alternative hypotheses where the largest difference between treatments is \(\Delta\): \[ \Delta = \max_{ij} |\tau_i - \tau_j|. \] When all treatments have \(r\) replicates, the alternative hypothesis with difference \(\Delta\) with lowest power is that in which

  • all but two treatments have the same mean
  • one treatment has mean \(\Delta/2\) lower
  • one treatment has mean \(\Delta/2\) higher.

In this worst case, we can use Theorem 8.1 with \[ \delta^2 = \frac{r \Delta^2}{2 \sigma^2} \] to calculate the power.

By calculating the power in the worst possible case, we know that in other alternatives with maximum difference \(\Delta\), the power would be higher. We can choose our sample size to attain power in the worst case and be confident of having at least that much power in any alternative we care about.

8.1.2 More complex designs

Similar power calculations can easily be extended to designs where the desired hypothesis can be tested with an \(F\) test, and the factor of interest is orthogonal to the blocks or other treatment factors.

Recall that when the treatments and blocks are orthogonal to each other, we can fit the factors completely separately, rather than needing one model with all factors. (That was the major appeal of orthogonal designs pre-computers.) In other words, we can analyze a blocked or multifactor design as a one-way design, after we subtract out the other factors or blocks.

That subtraction does not change the mean estimates. It only affects how we estimate \(\hat \sigma^2\).

8.1.3 Simulation

An alternative approach to calculate power is to simulate data from the experiment: draw the appropriate number of samples, run the desired analysis, calculate the test result, and repeat many times. The fraction of simulations that yield a significant result is the power.

Simulation has the advantage that it can be used for any alternative hypothesis that can be simulated, even if one can’t do the math to calculate the distribution under the alternative. If we have a design with non-orthogonal blocks, or expect the errors to be particularly non-normal, or have weird imbalances—it may not be easy to derive the distribution under the alternative, but we can certainly simulate it.

8.2 Precision

TODO how to guarantee a particular CI width

8.3 Exercises

Exercise 8.1 (Power of an F test) Consider an experiment with \(v = 4\) treatments. Suppose our alternative hypothesis is \[ H_A: \tau_1 = \tau_2 = \tau_3 = \tau_4 + 4, \] so one treatment group differs by 4. Based on a pilot study, we believe \(\sigma^2 \approx 8\).

Calculate the power of the test for different sample sizes, from \(r = 2\) to \(r = 10\), assuming \(\alpha = 0.05\). What \(r\) is necessary to achieve at least 80% power?