10  Early Stopping

So far, we have considered experiments where we choose \(n\), the number of experimental units, and then run the experiment. We obtain all \(n\) results in one batch and proceed to the analysis.

But many experiments don’t work this way. In most, the data is collected over time, and results appear gradually as trials are completed. For example, in a clinical trial of a medication, we do not recruit all the patients on the same day. Instead, the study begins enrolling patients on a particular day, and then accepts new patients who appear over months or years; some patients will be done with the treatment long before other patients even enroll.

This is practical, because there may not be \(n\) patients with the particular condition you’re studying all present at the hospital on the same day. You might only accumulate \(n\) cases over several years. It spreads out the work and ensures your medical staff aren’t trying to treat \(n\) people simultaneously.

But it also creates temptation. Can’t we peek at the early data to see how the experiment is growing? If we’re lucky, we’ll have clear results long before we have all \(n\) experimental units.

Similarly, if I’m running a trial of two versions of a website, new users appear gradually at their own pace. I can assign each new user to a treatment as they appear, and I will gradually get the sample size I need for my experiment.

In either case, there are good reasons to want to peek at the results early:

  1. Each observation may be expensive. My experimental medical treatment might cost hundreds of thousands of dollars per dose, so if I can stop the trial early, I save lots of money.
  2. Time is money. If I can stop early, I can move on to a new experiment sooner.
  3. If the early results are strong, it may not be ethical to continue the experiment: If my new experimental treatment cures 100% of patients, and the old standard treatment only cures 50% of patients, then once I’m sure the new treatment works, I should give it to everyone. It seems silly to keep assigning patients to the old treatment because my design says I need to. And if I’m trying two versions of a website, and one clearly makes more money than the other, there may be a good business reason to stop early and use the more profitable one.

That suggests a simple procedure: As soon as there is enough data to conduct the desired analysis (i.e. there’s at least one observation per treatment group and block, and so on), run the analysis. If the results are inconclusive, continue the study. If the results show the desired effect, stop the experiment. But this procedure has some problems.

10.1 Double-dipping

TODO

10.2 Sequential designs

Evidently we need test procedures that account for the sequential nature of the testing. One option comes from likelihood ratio tests.

Definition 10.1 (Likelihood ratio test) Consider obtaining samples \(X_1, \dots, X_n \sim f\), where \(f\) is a distribution parameterized by \(\theta\). We would like to test \(H_0: \theta \in \Theta_0\) against the alternative \(H_A: \theta \in \Theta_0^c\). The likelihood ratio test statistic is \[ \lambda(X) = \frac{\sup_{\Theta_0} L(\theta \mid X_1, \dots X_n)}{\sup_\Theta L(\theta \mid X_1, \dots X_n)}, \] where \(L(\theta \mid X_1, \dots, X_n)\) is the likelihood for \(f\) given the parameters and data.

The likelihood ratio test has a rejection region of the form \(\{X : \lambda(X) \leq c\}\), where \(0 \leq c \leq 1\). Generally \(c\) is chosen to obtain the desired \(\alpha\)-level.

You’ve likely used likelihood ratio tests before in your mathematical statistics courses. We can apply them to simple experimental sittings.

Example 10.1 (Likelihood ratio tests in experiments) Consider a trial with two treatment groups, denoted by \(X_i \in \{0, 1\}\). We observe \(Y\) and wish to test whether its mean is different between the two groups.

We can model \[ Y \sim \begin{cases} \text{Normal}(\mu_0, \sigma^2) & X_i = 0\\ \text{Normal}(\mu_1, \sigma^2) & X_i = 1 \end{cases} \] The parameter vector for the LRT is hence \(\theta = (\mu_0, \mu_1)\). The null hypothesis is \(H_0: \theta \in \Theta_0\), where \(\Theta_0 = \{(\mu_0, \mu_1) : \mu_0 = \mu_1\}\).

Likelihood ratio tests have all kinds of desirable properties, which you undoubtedly explored in your mathematical statistics courses. They lead to two possible decisions: reject \(H_0\) or fail to reject \(H_0\).

We can extend likelihood ratio tests to produce sequential likelihood ratio tests. These allow three possible decisions: reject \(H_0\), collect more data, or fail to reject \(H_0\).

Definition 10.2 (Sequential likelihood ratio test) Consider obtaining a sequence of samples \(X_1, X_2, \dots, X_n \sim f_n\), where \(f_n\) is a joint distribution. We would like to test \(H_0: f_n = f_{0n}\), for all \(n\), against the alternative \(H_A: f_n = f_{An}\) for all \(n\). The likelihood ratio statistic is \[ \lambda(X) = \frac{f_{An}(X_1, \dots, X_n)}{f_{0n}(X_1, \dots, X_n)}, \] the ratio of densities/likelihoods.

We choose constants \(0 < A < B < \infty\). At each \(n\), our decision is \[ \text{decision} = \begin{cases} \text{reject $H_0$} & \lambda(X) \geq B\\ \text{obtain new sample} & A < \lambda(X) < B\\ \text{fail to reject $H_0$} & \lambda(X) \leq A \end{cases} \] We hence increase \(n\) until we either reject or fail to reject. Denote the final sample size \(N\).

(Siegmund 1985, sec. 2.1)

This test was first derived by Wald in the 1940s, and allows us to conduct the test after every observation. Each time, we either reject, fail to reject, or proceed to collect the next sample. The trick is determining the appropriate rejection region by setting \(A\) and \(B\), although these can be directly calculated for some cases. You can imagine conducting a more complicated version of this where the observations come from a designed experiment.

This test has some nice properties. Consider \(\E_A[N]\), the expected final sample size under the alternative hypothesis, and \(\E_0[N]\), the expected final sample size under the null. One can show that this test minimizes these sample sizes among all other sequential tests with the same size and power, although “a complete proof is circuitous and difficult” (Siegmund 1985, 9).

Sequential designs hence get use in certain niche applications where we can continuously analyze the data and immediately terminate the experiment. But they are not practical for many experiments:

  1. We may not get all the data immediately after each unit is enrolled in the study. In a clinical trial, it may take months before we find out if the patient was cured; must we wait months between enrolling each patient?
  2. We may not be able to easily run the analysis after each observation. Perhaps it is computationally intensive, or perhaps we are running a double-blind clinical trial out of many hospitals, and so we (the leaders of the study) do not receive data labeled with the treatment immediately.
  3. It may not be possible to stop the trial at literally any time.

Consequently, it’s more common to design sequential experiments that let us enroll batches of experimental units. We enroll a batch, wait for their results, and then analyze the data to determine if more batches are necessary. Such a design is called a group sequential design.

10.3 Group sequential designs

Group sequential designs allow groups of experimental units to be enrolled in the study. As their results arrive, we can test for the presence of a treatment effect, and terminate the experiment early if we detect an effect.

Group sequential designs have become popular since the 1980s for clinical trials of new medications and surgeries, as they’re well-suited to the practical restrictions of large clinical trials. Large trials often have binary treatments (new treatment vs. old treatment or placebo) and enroll patients as they are found. Complicated blocking strategies are not used because they would require having all patients at the beginning of the study, so we could define the blocks and allocate treatments accordingly; simple block designs are often used, but require special care, as the experimenter may know how many patients have been allocated to each treatment and block so far, and hence know the treatment the next patient in a block will receive, breaking blinding. (TODO is this an accurate depiction of trial design?)

Consequently, most sequential design literature is on simple binary experiments without blocking and other fancy features.

For simplicity, let’s first consider a one-sample problem: We have a sequence of observations \(X_1, X_2, \dots\) from a distribution with unknown mean \(\mu\) and wish to test \[ H_0: \mu = \mu_0. \] How do we conduct such a hypothesis test in a group sequential fashion?

10.3.1 Setting critical values

In a group sequential design, we choose in advance \(K\), the maximum number of groups to enroll. Each group has size \(n_1, n_2, \dots, n_K\), so the maximum sample size is \(N = \sum_{k=1}^K n_k\).

Following Wassmer and Brannath (2016), section 1.2, let’s index the observations by group. We observe, in sequence, \[ \underbrace{X_{11}, \dots, X_{1n_1}}_\text{group 1}, \underbrace{X_{21}, \dots, X_{2n_2}}_\text{group 2}, \dots, \underbrace{X_{K1}, \dots, X_{K n_K}}_\text{group $K$}. \] After each group of units is observed, we can calculate its group mean, \[ \bar X_k = \frac{1}{k} = \sum_{i=1}^{n_k} X_{ki}, \] and we can calculate the cumulative mean of all observations up to this stage: \[ \bar X^{(k)} = \frac{1}{\sum_{j=1}^k n_j} \sum_{j=1}^k n_j \bar X_j. \]

Now we need a test statistic. To test \(H_0: \mu = \mu_0\), it seems reasonable to use a \(z\) test, so we can define the \(z\) statistic after group \(k\) as \[ Z_k^* = \frac{\bar X^{(k)} - \mu_0}{\sigma} \sqrt{\sum_{j=1}^k n_j}, \] which would be standard normal when the data is normally distributed with known variance \(\sigma^2\).

We could then define a rejection region for each test statistic \(Z_k^*\). We would collect group 1, run the test for \(Z_1^*\), and either stop (if significant) or continue (if not); if we continue, we’d collect group 2, run the test for $Z_2^*, and again either stop or continue. The process would repeat until group \(K\).

However, defining the rejection region is not easy. The statistics \(Z_1^*, \dots, Z_K^*\) are dependent, as they involve means of overlapping sets of random variables. The family-wise error rate, as defined in Section 6.6.3, is \[ \Pr(\cup_{k=1}^K \text{reject $Z_k^*$} \mid H_0), \] i.e. the probability that we reject any of the nulls under the alternative. The events are dependent. In principle we can still calculate this probability, as we can calculate the covariance between \(Z_i^*\) and \(Z_j^*\) and then work out the probabilities of each rejection; but even in the case where they are jointly multivariate normal, this begins to involve unpleasant integrals. It is difficult to work backwards and set the rejection region to attain a desired error rate \(\alpha\).

10.3.2 Equally-sized groups

The problem becomes simpler—at least in terms of notation and algebra—when we define the groups to have equal size. This seems perfectly reasonable, and any logistical constraints limiting you to a particular size are likely to apply to every group.

When the groups are equally sized with \(n = n_1 = \dots = n_K\), we can simplify the test statistic: \[ Z_k^* = \frac{1}{\sqrt{k}} \sum_{j=1}^k Z_k, \] where \[ Z_k = \frac{\bar X_k - \mu_0}{\sigma} \sqrt{n} \] is the test statistic for group \(k\) alone. From this, one can derive that for \(j \leq k\), \[ \cov(Z_j^*, Z_k^*) = \sqrt{\frac{j}{k}}, \] permitting easier calculation of the critical values.

There are two common approaches:

  • Set a rejection region that is the same for every \(Z_k^*\), adjusting appropriately for their correlation and multiplicity.
  • Set rejection regions that monotonically shrink proportional to \(\sqrt{k}\), again chosen appropriately for their correlation and multiplicity.

In either case, as the number of groups \(K\) increases, the tests become more stringent (less powerful). See Wassmer and Brannath (2016), section 2.1, for details on exactly how to set the critical values.

10.3.3 Alpha-spending

TODO Wassmer and Brannath (2016), section 3.3

10.3.4 Stopping for futility

It is also possible to design a sequential experiment that can stop the experiment if it is futile. If we are testing a one-sided hypothesis, such as “the treatment is better than the control”, we may want to stop the experiment if the treatment does spectacularly well in the first few groups—but we may also want to stop if it does poorly in the first few groups, as this would suggest we will not find evidence in favor of our alternative. Collecting further data would hence be futile.

This is again widely used in clinical trials, as we’d like to stop early if the treatment is not working or is even actively harming the patients.

It is possible to define futility stopping rules and adjust the test critical regions accordingly, although again, the math is tedious because of the correlation between the test statistics for each group.