9 Response Surfaces
\[ \DeclareMathOperator{\E}{\mathbb{E}} \DeclareMathOperator{\R}{\mathbb{R}} \DeclareMathOperator{\RSS}{RSS} \DeclareMathOperator{\var}{Var} \DeclareMathOperator{\cov}{Cov} \DeclareMathOperator{\trace}{trace} \newcommand{\ind}{\mathbb{I}} \newcommand{\T}{^\mathsf{T}} \newcommand{\X}{\mathbf{X}} \newcommand{\Y}{\mathbf{Y}} \newcommand{\tsp}{\bar \tau_\text{sp}} \newcommand{\tsphat}{\hat \tau_\text{sp}} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \]
So far, we have considered experiments where the goal is to estimate treatment means or mean differences. But as we discussed in Chapter 3, that is not the only possible goal for an experiment.
Suppose we have \(p\) treatment factors, each of which is a continuous variable. In principle, the value of the mean response at each possible combination of factors forms a surface in \((p + 1)\)-dimensional space. We can call this a response surface.
We can express the possible goals of an experiment in terms of the shape of that surface. So far, we have considered experiments to find the difference in response between specific points on that surface. But what if we want to learn the shape of the surface more generally, for instance so we can find the maximum?
We can easily imagine situations where finding a maximum is desirable:
- The treatment factors are parameters for some chemical production process—input concentrations, temperature, pressure, and so on. The response is the efficiency of the process. We’d like to find the most efficient way to run the chemical plant.
- The treatment factors are parameters for YouTube’s video recommendation algorithm. The response is the “engagement” of users, measured by how many videos they watch, how long they watch, how many ads they click on, or some other metric that product managers decide is interesting. We’d like to make people spend the maximum possible amount of their lives glued to their phones.
- The treatment factors are doses of various medications. The response is the fraction of tumor cells (in a test tube) that die when exposed to the medications. We’d like to find a combination of cancer drugs that cause the strongest effect on tumors.
You can undoubtedly think of many other examples where we want to optimize something by controlling various factors.
The simplest way to achieve these goals is to try lots and lots of combinations of the factors, then pick the best one according to the estimated treatment means. But these are continuous variables, so the space of possible combinations is large, and making a grid across the entire space might entail testing millions of combinations. How can we do better?
We might start by treating this as a numerical optimization problem. There is some unknown function \(f(x_1, x_2, \dots, x_p)\) that produces the response as a function of the \(p\) factors. We do not know \(f\), so we cannot solve for its maximum. But perhaps we believe that \(f\) is reasonably smooth, so we can learn something about its shape by evaluating it at well-chosen points. We could proceed iteratively instead: start with a guess, then figure out what changes would increase the response, repeating the process until the response has increased to a local maximum.
9.1 Optimization in general
You’re probably already familiar with a few numerical optimization techniques. Let’s consider a couple examples.
First, there are first-order optimization methods. “First-order” means they use the function’s value and its derivative at a point, but no other information about it. Perhaps the most famous first-order method is gradient ascent.
Definition 9.1 (Gradient ascent) Consider maximizing a function \(f : \R^p \to \R\). Start with an initial guess \(x^{(0)} \in \R^p\). Then, for steps \(k = 1, 2, \dots\) until convergence, update the guess with \[ x^{(k + 1)} = x^{(k)} + \gamma \nabla f(x^{(k)}), \] where \(\gamma > 0\) is a step size and \(\nabla f(x^{(k)})\) is the gradient, the vector of partial derivatives of \(f\) evaluated at \(x^{(k)}\).
This procedure is called gradient ascent. If the step size \(\gamma\) is chosen well, it will converge to a (possibly local) maximum of \(f\).
We can interpret gradient ascent as a simple procedure: Guess \(x\), find the slope of \(f\) at \(x\), then move \(x\) in the direction of steepest ascent. Repeat until the slope is basically flat.
First-order methods are widely used—you’re probably familiar with gradient descent, which is just gradient ascent but for minimization. It is spectacularly common for training neural networks, and indeed is so popular that you might forget there are any other ways to do it.
But there are second-order optimization methods as well. A common one is Newton’s method.
Definition 9.2 (Newton’s method for optimization) Consider maximizing a function \(f : \R^p \to \R\). Start with an initial guess \(x^{(0)} \in \R^p\). Then, for steps \(k = 1, 2, \dots\) until convergence, update the guess with \[ x^{(k + 1)} = x^{(k)} - \left( \nabla^2 f(x^{(k)}) \right)^{-1} \nabla f(x^{(k)})\T, \] where \(\nabla^2 f(x^{(k)})\) is the matrix of second partial derivatives of \(f\) evaluated at \(x^{(k)}\).
Newton’s method amounts to approximating \(f\) with a quadratic, defined according to the first and second derivatives of \(f\), and then updating our guess to the maximum of that quadratic. No step size parameter is required.
In general, optimization methods like these work well when \(f\) is well-behaved: when it is differentiable and, ideally, convex. If \(f\) has numerous local maxima, no numerical optimization method will provide a guarantee that regardless of the starting point \(x^{(0)}\), the optimizer will find the global maximum and not just a local maximum.
9.2 First-order response surface designs
First-order response surface designs apply the basic idea of gradient ascent to a function \(f\) whose value \(f(x)\) and gradient \(\nabla f(x)\) can only be approximated by running an experiment. Suppose our response variable \(Y\) is related to the treatment factors \(X\) by \(Y = f(X) + \epsilon\), and we would like to find \[ x^* = \argmax_x f(x). \] The iterative optimization approach suggests we should zoom in on the surface \(f(x)\). Rather than trying to learn the entire shape of \(f\) from a single experiment, such as by doing a huge factorial experiment over the entire range of every treatment factor, we pick a starting value \(x^{(0)}\). We then try to estimate \(\nabla f(x^{(0)})\) by doing a local experiment that changes \(x\) slightly to estimate the gradient at \(x^{(0)}\).
[TODO diagram of a 3D response surface, showing us zooming in and fitting a plane to a small part]
After we estimate \(\nabla f(x^{(0)})\), we apply the update rule in Definition 9.1, then repeat for each \(x^{(k)}\). Estimating \(\nabla f(x^{(k)})\) is the essential part: the gradient tells us how to change \(x^{(k)}\) to optimize the function.
9.2.1 Desirable properties
Given our iterative optimization approach, our design need only produce an estimate of \(\nabla f(x)\); the model can be as simple as possible as long as it can give us the gradient. It follows that we can use the Taylor approximation of \(f(x)\): \[ f(x) \approx f(x^*) + \nabla f(x^*)\T (x - x^*). \] This approximates \(f(x)\) as a linear function of \(x\), so evidently a simple linear model will suffice. The slope coefficients \(\beta\) determine the gradient.
We can thus determine that an experimental design for response surface modeling should:
- Give us the best possible estimate of \(\nabla f(x^{(k)})\) given a fixed number of experimental units
- Or, to make that precise, have the lowest possible variance of the slope estimates \(\hat \beta\)
- Let us estimate \(\hat \sigma^2\), so we can tell if the linear approximation fits well
We can elaborate the second criterion further. A reasonable objective would be to minimize the average variance of the regression coefficients, \[ \frac{1}{p + 1} \sum_{i=0}^p \var(\hat \beta_i) = \frac{1}{p + 1} \trace(\var(\hat \beta)). \] One can show that when the treatment factors are centered and scaled, this objective is minimized when the columns of the design matrix \(\X\) are orthogonal (Exercise 9.3). This suggests using factorial designs and their variations.
9.2.2 Standard first-order designs
Consider a situation where \(p = 2\), so we can draw the design space in 2D:
[TODO diagram]
Objective 1 suggests we should use a design with four combinations of the two treatment factors, meaning a factorial design (Section 7.1) or fractional factorial design (Section 7.5). [TODO draw]. Objective 3 suggests we need more than one replicate at one or more design points, so we can estimate \(\hat \sigma^2\); but putting a replicate at only one design point makes the design not orthogonal, while putting replicates at every design point means doubling or tripling the sample size.
So we add a center point. [TODO draw]. The center point has \(n_0\) replicates, typically somewhere between 3 and 6. This design is now called a “standard first-order design.” With the center point, the design matrix can still be written in orthogonal form. If we rescale the variables so their largest and smallest values are \(\pm 1\) and the center point is at 0, the design matrix is \[ \X = \begin{pmatrix} 1 & -1 & -1\\ 1 & -1 & +1\\ 1 & +1 & -1 \\ 1 & +1 & +1 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & \vdots & \vdots \end{pmatrix}. \] We can see that the two factor columns are indeed orthogonal. And if we fit the linear model \(Y = f(X) + \epsilon = X \beta + \epsilon\), where \(\beta = (\beta_0, \beta_1, \beta_2)\), then \[ \nabla f(X) = \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix}, \] so we may trivially estimate the gradient using \(\hat \beta\). We then plug the estimate into Definition 9.1 to find the next center point and repeat the procedure.
A full factorial first-order design has \(2^p + 1\) design points, where \(p\) is the number of treatment factors. With \(n_0\) replicates at the center point, this requires \(2^p + n_0\) observations. We can reduce this by using fractional factorial designs (Section 7.5), provided the main effects are not aliased with each other.
9.2.3 Testing for lack of fit
The general intuition is that when \(x^{(k)}\) is far from the maximum, a linear model is probably a good approximation of \(f\). It at least tells us the direction of steepest ascent. As \(x^{(k)}\) gets close to the maximum, the curvature may become more important—if we imagine a nice convex function, it looks fairly quadratic near the maximum, not linear.1
This suggests we need a way of testing whether a first-order approximation is sufficient, or whether a second-order approximation is necessary. Your first guess may be a partial \(F\) test: fit a linear model and then a model with quadratic terms, and test the null hypothesis that the quadratic coefficients are zero.
But unfortunately you will find that the quadratic terms are aliased in this design. They can’t be separately estimated, so you can’t do the obvious \(F\) test. The problem is simply the number of parameters. For example, when \(p = 2\), a full quadratic model has 6 parameters (intercept, two linear terms, two quadratic terms, and the interaction); but the number of unique design points is \(2^p + 1 = 5\), so the design matrix has only 5 unique rows, regardless of the number of center point replicates. The design matrix is not of full rank, so there is not a unique solution to the least-squares problem. Or, in other words, the design matrix is necessarily collinear.
However, it is still possible to obtain \(\hat Y\), the vector of fitted values, in a linear model with a collinear design matrix. (See the 707 lecture notes, section 9.5.) We can hence still calculate the residual sum of squares (RSS) for the quadratic model and compare it to the RSS for the linear model using a standard partial \(F\) test. However, the parameters of the partial \(F\) test’s null distribution depend on the number of parameters and degrees of freedom, which are tedious to work out when the model is not of full rank, so it is easier to understand the test when it’s worked out differently.
A partial \(F\) test takes the ratio of the change in RSS due to using the larger model over the RSS of the larger model, each scaled by their respective degrees of freedom. Here, the RSS of the larger model is called the sum of squares for pure error. The larger model will fit the factorial points perfectly, so only at the center point (where there are replicates) is there error. Hence we calculate \[ \text{SS}_\text{PE} = \sum_{\text{$i$ at center point}} (Y_i - \bar Y_\text{center})^2, \] the sum of squares for the observations at the center point. This has \(n_0 - 1\) degrees of freedom. The residual sum of squares of the first-order model, \(\RSS_\text{FO}\), has \(n - p - 1\) degrees of freedom. Here \(n = n_d + n_0 - 1\), where \(n_d\) is the number of unique design points, so the \(F\) statistic is hence \[ F = \frac{(\RSS_\text{FO} - \text{SS}_\text{PE}) / (n_d - p - 1)}{\text{SS}_\text{PE} / (n_0 - 1)}. \] Under the null this follows the \(F_{n_d - p - 1, n_0 - 1}\) distribution. When the first-order model fits poorly, \(\RSS_\text{FO}\) is large, causing the \(F\) statistic to grow and causing the null to be rejected. Rejecting the null indicates the linear fit is poor and suggests we are near a maximum (or ridge, or saddle point) of the response surface.
9.3 Second-order response surface designs
Second-order response surface designs try to apply Newton’s method to optimize a function \(f\) whose gradient \(\nabla f(x)\) and Hessian \(\nabla^2 f(x)\) can only be approximated by experiment.
We will need the Hessian, so following the same Taylor expansion argument, we will need to include the second-order term (hence the name): \[ f(x) \approx f(x^*) + \nabla f(x^*)\T (x - x^*) + (x - x^*)\T \nabla^2 f(x^*) (x - x^*). \] This approximates \(f(x)\) as a quadratic function of \(x\), so we will fit a model with quadratic terms. For instance, if there are two predictors, \(X_1\) and \(X_2\), we will fit the model \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 + \epsilon. \] If we fit this model and take its second derivatives to estimate \(\nabla^2 f(x)\), we will find the estimated Hessian depends on \(\beta_3\), \(\beta_4\), and \(\beta_5\) (see Exercise 9.1), so it is crucial that we estimate these well.
The standard first-order designs of Section 9.2 will not work here, as they do not allow us to fit quadratic terms. We will need additional design points.
We hence want a design that
- Gives us the best possible estimate of both \(\nabla f(x)\) and \(\nabla^2 f(x)\)
- Lets us estimate \(\hat \sigma^2\), so we can tell if the quadratic approximation fits well
- Is, ideally, orthogonal, so the variance of slope and Hessian estimates is as small as possible (and uncorrelated).
9.3.1 Central composite designs
The typical solution to these requirements is the central composite design. A central composite design starts with a standard first-order design with \(n_0\) center points and \(n_f\) factorial points. It then adds \(n_a\) axial points, which are at values \(\pm \alpha\) of each treatment factor, as illustrated in Figure 9.1.
As before, we typically choose \(n_0\) between 3 and 6 to ensure \(\hat \sigma^2\) can be estimated. Choosing \(\alpha\) is slightly more difficult, and depends on the specific design properties we want. One desirable property could be that the variance of \(\hat f(X)\) is the same in any direction.
Definition 9.3 (Rotatable designs) An experimental design is rotatable if \(\var(\hat f(x))\), the variance of the fitted model’s prediction at a point \(x\), is a function only of the distance \(\|x\|_2^2\) of the point to the origin.
Hence in a rotatable design the variance is symmetric around the origin. This is a reasonable request: unless one factor is more important than the others, we should choose a design that obtains equal information about the response surface in each direction. We can obtain a rotatable central composite design by carefully choosing \(\alpha\).
Theorem 9.1 (Rotatable central composite designs) A central composite design is rotatable if \(\alpha = n_f^{1/4}\), where \(n_f\) is the number of factorial points.
Proof. See Box and Hunter (1957).
We can also ask for the design to be orthogonal, i.e., the columns in the design matrix are all orthogonal to each other (except for the intercept). But here it is less obvious why this is desirable. Why should the design matrix columns for \(X_1\) and \(X_1^2\) be orthogonal to each other? Ensuring they are orthogonal to the \(X_2\) and \(X_2^2\) columns is reasonable, so the effect estimates are uncorrelated, but requiring all pairs of columns to be orthogonal is less obviously useful.
Nonetheless, it’s convenient if you are conducting tests for specific coefficients, for instance if you want to test if a quadratic relationship is truly necessary for one specific treatment factor. It also simplifies computation, of course, by reducing the multiple regression problem to separate univariate regression problems, but this is not a concern if we’re using a computer to fit our models.
The requirements on \(\alpha\) and \(n_0\) for orthogonality are more complicated than for rotatability; see Dean, Voss, and Draguljić (2017), section 16.4.2. It may not always be possible for a design to be both orthogonal and rotatable.
9.3.2 Fitting the quadratic model
Once we have constructed a central composite design and collected the data, we can fit a second-order polynomial model. Rather than writing out sums of linear, quadratic, and interaction terms, it is easier to consider the model in matrix form. Let \(X \in \R^p\) be a point in the design space. Then a quadratic model for the response can be written as \[ Y = f(X) = \beta_0 + \beta\T X + X\T B X + \epsilon, \] where \(\beta_0\) is the intercept, \(\beta \in \R^p\) is a vector of slope coefficients, and \(B\) is a \(p \times p\) matrix. The diagonal entries of \(B\) give the coefficients of quadratic terms, while the off-diagonal entries are the coefficients of interactions between pairs of variables. This model can be fit using least squares in the usual way, as you have likely done before for many models with polynomials and interactions, and the coefficients rewritten in this matrix and vector form.
A quadratic equation has a stationary point that is either a minimum, maximum, or saddle point. We can find the stationary point \(x_s\) by setting the gradient to zero: \[\begin{align*} 0 &= \nabla f(x_s) \\ &= \beta + 2 B x_s \\ x_s &= - \frac{1}{2} B^{-1} \beta. \end{align*}\] To determine if this is a minimum, maximum, or saddle point, we must examine the second derivatives. When you learned quadratic equations in one variable, you learned to use the sign of the second derivative to check. Here there is an entire matrix of second derivatives, the Hessian: \[ \nabla^2 f(x) = 2 B. \] This is the same Hessian as a quadratic model that omits the linear terms, and the Hessian is the same regardless of \(x\). For simplicity, then, consider the purely quadratic model \[ f^*(X) = X\T B X. \] Calculate the eigenvalue decomposition \(B = UDU^{-1}\), where \(U\) is the matrix of eigenvectors and \(D\) a diagonal matrix of eigenvalues. Since \(B\) is symmetric, \(U\) is orthogonal and \(U^{-1} = U\T\). Then \[\begin{align*} f^*(X) &= X\T UDU^{-1} X\\ &= (U\T X)\T D (U\T X) \\ &= \sum_{j=1}^p \lambda_j (U\T X)_j^2, \end{align*}\] where \(\lambda_1, \dots, \lambda_p\) are the eigenvalues of \(B\). When each eigenvalue \(\lambda_j > 0\), this function must be nonnegative regardless of \(X\); when each eigenvalue \(\lambda_j < 0\), this function must be nonpositive regardless of \(X\); and when the eigenvalues have mixed signs, the sign depends on the particular \(X\). The stationary point of this quadratic is at \(U\T X = 0\), at which point it has value 0. Hence when the eigenvalues are all positive, the stationary point must be the minimum; when they’re negative, it must be the maximum, and when they’re mixed signs, it is a saddle point.
Again, this translates back to our actual quadratic function \(f(X)\), because adding the intercept and linear terms does not change whether a point is a minimum, maximum, or saddle point. Examining the eigenvalues of \(B\) (and hence of the Hessian) is the exact equivalent of examining the sign of the second derivative in one dimension. This check is known in response surface design as canonical analysis.
We can use canonical analysis to determine whether we have found a maximum and where the maximizer is.
9.4 The response-surface process
Based on the above considerations, we can write the general process for finding the value \(x^*\) that maximizes an unknown function \(f(X)\) through repeated experiments. It looks something like this:
- Select a reasonable starting point \(x^{(0)}\) for all factors.
- Select a first-order response surface design around this point, such as a standard first-order design. Conduct the experiment.
- Fit a linear model to the first-order design and test its fit. If it fits reasonably well, find the direction of steepest ascent, take a step in that direction, and return to step 2. Otherwise, continue to step 4.
- Fit the quadratic model and use canonical analysis to determine if you are near a maximum. If you are near a maximum, declare the stationary point to be your desired \(x^*\), or perhaps run one more experiment there to verify the exact value. If you are near a minimum or saddle point, figure out how to move to get away from it, and return to step 2.
Fortunately, packages exist to calculate common designs, the models to be fit are standard linear and quadratic regression models, and canonical analysis can be done easily by many packages. Let’s examine an example using one such package, the rsm package for R.
9.5 Exercises
Exercise 9.1 (Hessian of the polynomial model) Consider fitting the polynomial model \[ Y = f(X) + \epsilon = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 + \epsilon. \] Calculate \(\nabla^2 f(x)\). Show that it depends only on \(\beta_3\), \(\beta_4\), and \(\beta_5\).
Exercise 9.2 (Quadratic fits to standard first-order designs) In Section 9.2.2, we suggested that regardless of the number of center-point replicates, it’s not possible to fit a full quadratic model to a standard first-order design. We showed this is true when there are \(p = 2\) treatment factors, because there are 5 design points but 6 parameters to estimate.
Derive an expression for the number of parameters in a quadratic fit to \(p\) parameters, including all interactions. Compare this to the number of design points in a full factorial first-order design, and show that for any \(p > 1\), the quadratic fit is not uniquely identified.
Exercise 9.3 (Orthogonal first-order designs) In Section 9.2, we stated that a desirable property of a first-order design is that it minimizes the average variance of the regression coefficients, \[ \frac{1}{p + 1} \trace(\var(\hat \beta)). \] Consider the case where we center and scale the design matrix, so that (apart from the intercept) each column of \(\X\) has mean 0 and sample variance 1, and we assume constant error variance.
- The centering and scaling imply constraints on the sums of columns of \(\X\). write these constraints in terms of the individual entries \(x_{ij}\) of the design matrix.
- Rewrite the minimization in terms of \((\X\T \X)^{-1}\), so it can be connected to the design matrix \(\X\).
- Use the eigenvalue decomposition to show that because \(\X\T\X\) is symmetric and positive definite, the trace of \((\X\T\X)^{-1}\) trace can be written in terms of the eigenvalues of \(\X\T\X\). Use the eigenvalues to connect this to the trace of \(\X\T\X\).
- Using the constraints you found in part 1, show that \(\trace(\X\T\X)\) must be equal to \(n + p(n - 1)\).
- TODO Further steps to show that minimizing the average variance implies that the columns of \(\X\) must be orthonormal.
Exercise 9.4 (Quadratic fits without center points) Argue that in the standard first-order design (Section 9.2.2), omitting the center point makes it impossible to test for lack of fit using quadratic terms. TODO
This intuition appears to be common in response surface designs, but I’m not sure it’s justified by an appeal to functions having particular smoothness and curvature properties. Perhaps experimental design theorists just expect that in nature, response surfaces will be nice well-behaved functions, and that no physical response surface could look like the Weierstrass function.↩︎