9  Response Surfaces

So far, we have considered experiments where the goal is to estimate treatment means or mean differences. But as we discussed in Chapter 3, that is not the only possible goal for an experiment.

Suppose we have \(p\) treatment factors, each of which is a continuous variable. In principle, the value of the mean response at each possible combination of factors forms a surface in \((p + 1)\)-dimensional space. We can call this a response surface.

We can express the possible goals of an experiment in terms of the shape of that surface. So far, we have considered experiments to find the difference in response between specific points on that surface. But what if we want to learn the shape of the surface more generally, for instance so we can find the maximum?

We can easily imagine situations where finding a maximum is desirable:

  1. The treatment factors are parameters for some chemical production process—input concentrations, temperature, pressure, and so on. The response is the efficiency of the process. We’d like to find the most efficient way to run the chemical plant.
  2. The treatment factors are parameters for YouTube’s video recommendation algorithm. The response is the “engagement” of users, measured by how many videos they watch, how long they watch, how many ads they click on, or some other metric that product managers decide is interesting. We’d like to make people spend the maximum possible amount of their lives glued to their phones.
  3. The treatment factors are doses of various medications. The response is the fraction of tumor cells (in a test tube) that die when exposed to the medications. We’d like to find a combination of cancer drugs that cause the strongest effect on tumors.

You can undoubtedly think of many other examples where we want to optimize something by controlling various factors.

The simplest way to achieve these goals is to try lots and lots of combinations of the factors, then pick the best one according to the estimated treatment means. But these are continuous variables, so the space of possible combinations is large, and making a grid across the entire space might entail testing millions of combinations. How can we do better?

We might start by treating this as a numerical optimization problem. There is some unknown function \(f(x_1, x_2, \dots, x_p)\) that produces the response as a function of the \(p\) factors. We do not know \(f\), so we cannot solve for its maximum. We could proceed iteratively instead: start with a guess, then figure out what changes would increase the response, repeating the process until the response has increased to a local maximum.

9.1 Optimization in general

You’re probably already familiar with a few numerical optimization techniques. Let’s consider a couple examples.

First, there are first-order optimization methods. “First-order” means they use the function’s value and its derivative at a point, but no other information about it. Perhaps the most famous first-order method is gradient ascent.

Definition 9.1 (Gradient ascent) Consider maximizing a function \(f : \R^p \to \R\). Start with an initial guess \(x^{(0)} \in \R^p\). Then, for steps \(k = 1, 2, \dots\) until convergence, update the guess with \[ x^{(k + 1)} = x^{(k)} + \gamma \nabla f(x^{(k)}), \] where \(\gamma > 0\) is a step size and \(\nabla f(x^{(k)})\) is the gradient, the vector of partial derivatives of \(f\) evaluated at \(x^{(k)}\).

This procedure is called gradient ascent. If the step size \(\gamma\) is chosen well, it will converge to a (possibly local) maximum of \(f\).

We can interpret gradient ascent as a simple procedure: Guess \(x\), find the slope of \(f\) at \(x\), then move \(x\) in the direction of steepest ascent. Repeat until the slope is basically flat.

First-order methods are widely used—you’re probably familiar with gradient descent, which is just gradient ascent but for minimization. It is spectacularly common for training neural networks, and indeed is so popular that you might forget there are any other ways to do it.

But there are second-order optimization methods as well. A common one is Newton’s method.

Definition 9.2 (Newton’s method for optimization) Consider maximizing a function \(f : \R^p \to \R\). Start with an initial guess \(x^{(0)} \in \R^p\). Then, for steps \(k = 1, 2, \dots\) until convergence, update the guess with \[ x^{(k + 1)} = x^{(k)} - \left( \nabla^2 f(x^{(k)}) \right)^{-1} \nabla f(x^{(k)})\T, \] where \(\nabla^2 f(x^{(k)})\) is the matrix of second partial derivatives of \(f\) evaluated at \(x^{(k)}\).

Newton’s method amounts to approximating \(f\) with a quadratic, defined according to the first and second derivatives of \(f\), and then updating our guess to the maximum of that quadratic. No step size parameter is required.

In general, optimization methods like these work well when \(f\) is well-behaved: when it is differentiable and, ideally, convex. If \(f\) has numerous local maxima, no numerical optimization method will provide a guarantee that regardless of the starting point \(x^{(0)}\), the optimizer will find the global maximum and not just a local maximum.

9.2 First-order response surface designs

First-order response surface designs apply the basic idea of gradient ascent to a function \(f\) whose value \(f(x)\) and gradient \(\nabla f(x)\) can only be approximated by running an experiment. Suppose our response variable \(Y\) is related to the treatment factors \(X\) by \(Y = f(X) + \epsilon\), and we would like to find \[ x^* = \argmax_x f(x). \] The iterative optimization approach suggests we should zoom in on the surface \(f(x)\). Rather than trying to learn the entire shape of \(f\) from a single experiment, such as by doing a huge factorial experiment over the entire range of every treatment factor, we pick a starting value \(x^{(0)}\). We then try to estimate \(\nabla f(x^{(0)})\) by doing a local experiment that changes \(x\) slightly to estimate the gradient at \(x^{(0)}\).

[TODO diagram of a 3D response surface, showing us zooming in and fitting a plane to a small part]

After we estimate \(\nabla f(x^{(0)})\), we apply the update rule in Definition 9.1, then repeat for each \(x^{(k)}\). Estimating \(\nabla f(x^{(k)})\) is the essential part: the gradient tells us how to change \(x^{(k)}\) to optimize the function. Hence we want a design that

  1. Gives us the best possible estimate of \(\nabla f(x^{(k)})\) given a fixed number of experimental units
  2. Lets us estimate \(\hat \sigma^2\), so we can tell if the linear approximation fits well
  3. Is, ideally, orthogonal, so the variance of the slope estimates is as small as possible (and uncorrelated).

When we analyze the data from such an experiment, our model need only produce an estimate of \(\nabla f(x)\); the model can be as simple as possible as long as it can give us the gradient. It follows that we can use the Taylor approximation of \(f(x)\): \[ f(x) \approx f(x^*) + \nabla f(x^*)\T (x - x^*). \] This approximates \(f(x)\) as a linear function of \(x\), so evidently a simple linear model will suffice.

Consider a situation where \(p = 2\), so we can draw the design space in 2D:

[TODO diagram]

Objective 1 suggests we should use a design with four combinations of the two treatment factors. [TODO draw]. Objective 2 suggests we need more than one replicate at one or more design points, so we can estimate \(\hat \sigma^2\); but putting a replicate at only one design point makes the design not orthogonal, while putting replicates at every design point means doubling or tripling the sample size.

So we add a center point. [TODO draw]. The center point has \(n_0\) replicates, typically somewhere between 3 and 6. This design is now called a “standard first-order design.” With the center point, the design matrix can still be written in orthogonal form. If we rescale the variables so their largest and smallest values are \(\pm 1\) and the center point is at 0, the design matrix is \[ X = \begin{pmatrix} 1 & -1 & -1\\ 1 & -1 & +1\\ 1 & +1 & -1 \\ 1 & +1 & +1 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & \vdots & \vdots \end{pmatrix}. \] We can see that the two factor columns are indeed orthogonal. And if we fit the linear model \(Y = f(X) + \epsilon = X \beta + \epsilon\), where \(\beta = (\beta_0, \beta_1, \beta_2)\), then \[ \nabla f(X) = \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix}, \] so we may trivially estimate the gradient using \(\hat \beta\). We then plug the estimate into Definition 9.1 to find the next center point and repeat the procedure.

The general intuition is that when \(x^{(k)}\) is far from the maximum, a linear model is probably a good approximation of \(f\). It at least tells us the direction of steepest ascent. As \(x^{(k)}\) gets close to the maximum, the curvature may become more important—if we imagine a nice convex function, it looks fairly quadratic near the maximum, not linear.1

This suggests we need a way of testing whether a first-order approximation is sufficient, or whether a second-order approximation is necessary. Your first guess may be a partial \(F\) test: fit a linear model and then a model with quadratic terms, and test the null hypothesis that the quadratic coefficients are zero.

But unfortunately you will find that the quadratic terms are aliased in this design. They can’t be separately estimated, so you can’t do the obvious \(F\) test.

TODO

9.3 Second-order response surface designs

Second-order response surface designs try to apply Newton’s method to optimize a function \(f\) whose gradient \(\nabla f(x)\) and Hessian \(\nabla^2 f(x)\) can only be approximated by experiment.

We will need the Hessian, so following the same Taylor expansion argument, we will need to include the second-order term (hence the name): \[ f(x) \approx f(x^*) + \nabla f(x^*)\T (x - x^*) + (x - x^*)\T \nabla^2 f(x^*) (x - x^*). \] This approximates \(f(x)\) as a quadratic function of \(x\), so we will fit a model with quadratic terms. For instance, if there are two predictors, \(X_1\) and \(X_2\), we will fit the model \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 + \epsilon. \] If we fit this model and take its second derivatives to estimate \(\nabla^2 f(x)\), we will find the estimated Hessian depends on \(\beta_3\), \(\beta_4\), and \(\beta_5\) (see Exercise 9.1), so it is crucial that we estimate these well.

The standard first-order designs of Section 9.2 will not work here, as they do not allow us to fit quadratic terms. We will need additional design points.

We hence want a design that

  1. Gives us the best possible estimate of both \(\nabla f(x)\) and \(\nabla^2 f(x)\)
  2. Lets us estimate \(\hat \sigma^2\), so we can tell if the quadratic approximation fits well
  3. Is, ideally, orthogonal, so the variance of slope and Hessian estimates is as small as possible (and uncorrelated).

The typical solution to these requirements is the central composite design. A central composite design starts with a standard first-order design with \(n_0\) center points and \(n_f\) factorial points. It then adds \(n_a\) axial points, which are at values \(\pm \alpha\) of each treatment factor:

[TODO diagram]

As before, we typically choose \(n_0\) between 3 and 6 to ensure \(\hat \sigma^2\) can be estimated. Choosing \(\alpha\) is slightly more difficult, and depends on the specific design properties we want. There are two properties that can be desirable.

9.3.1 Rotatability

If we estimate \(\hat f(X)\) as a quadratic function, we can estimate \(\var(\hat Y) = \var(\hat f(X))\), the variance of the predicted response. If all treatment factors are on the same scale, it seems reasonable to ask that this variance is the same for any design points an equal distance from the design center (the origin). That is, \(\var(\hat f((1, 0, 0))) = \var(\hat f((0, 1, 0))) = \var(\hat f((0, 0, 1))\), and so on. This ensures we have equal information about the response surface in each direction, rather than favoring one treatment factor over the others. That’s reasonable enough, as we don’t know which direction the maximum will be in, so we’d like to estimate \(f\) equally well in all directions.

A design satisfying this property is called rotatable.

Theorem 9.1 (Rotatable central composite designs) A central composite design is rotatable if \(\alpha = n_f^{1/4}\), where \(n_f\) is the number of factorial points.

Proof. See Box and Hunter (1957).

9.3.2 Orthogonality

We can also ask for the design to be orthogonal, i.e. the columns in the design matrix are all orthogonal to each other. But here it is less obvious why this is desirable. Why should the design intercept columns for \(X_1\) and \(X_1^2\) be orthogonal to each other? Ensuring they are orthogonal to the \(X_2\) and \(X_2^2\) columns is reasonable, so the effect estimates are uncorrelated, but requiring all pairs of columns to be orthogonal is less obviously useful.

Nonetheless, it’s convenient if you are conducting tests for specific coefficients, for instance if you want to test if a quadratic relationship is truly necessary for one specific treatment factor. It also simplifies computation, of course, but this is not a concern if we’re using a computer to fit our models.

The requirements on \(\alpha\) and \(n_0\) for orthogonality are more complicated than for rotatability; see Dean, Voss, and Draguljić (2017), section 16.4.2. It may not always be possible for a design to be both orthogonal and rotatable.

9.4 The response-surface process

TODO

9.5 Exercises

Exercise 9.1 (Hessian of the polynomial model) Consider fitting the polynomial model \[ Y = f(X) + \epsilon = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 + \epsilon. \] Calculate \(\nabla^2 f(x)\). Show that it depends only on \(\beta_3\), \(\beta_4\), and \(\beta_5\).


  1. This intuition appears to be common in response surface designs, but I’m not sure it’s justified by an appeal to functions having particular smoothness and curvature properties. Perhaps experimental design theorists just expect that in nature, response surfaces will be nice well-behaved functions, and that no physical response surface could look like the Weierstrass function.↩︎