See also Causality, Observational studies.
Many experimental design textbooks, written for practicing scientists, don’t explain why we’re going to all of this trouble to design elaborate treatment allocations. Why, exactly, do I want to use a Latin square over some other allocation of treatments? In most cases, the purpose behind designs is control of estimation variance: by choosing treatment allocation carefully, we can obtain a treatment effect estimate that has the lowest possible variance, given our sample size constraints. This involves clever tricks like making treatment effects orthogonal to other effects by design.
Imbens and Rubin (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.
Part II, “Classical Randomized Experiments”, defines the basic randomized experiments (completely randomized, blocking, etc.) in terms of potential outcomes.
[To read] Ding, P., Li, X., & Miratrix, L. W. (2017). Bridging finite and super population causal inference. Journal of Causal Inference, 5(2). doi:10.1515/jci-2016-0027
Mutz, D. C., Pemantle, R., & Pham, P. (2018). The perils of balance testing in experimental design: Messy analyses of clean data. The American Statistician. doi:10.1080/00031305.2017.1322143
Some people recommend that in a randomized experiment with assorted covariates (like patient demographics), you should test that the covariates are roughly equal between treatment and control groups, indicating the randomization was “successful”. This doesn’t make sense, particularly if we use it as justification for including or excluding specific covariates from the model.
[To read] Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10(3), 273–304. doi:10.1214/ss/1177009939
Most experimental design books focus heavily on sums of squares and tedious algebraic manipulations to show the various properties of estimators, contrasts, and so on. But designs can be represented in matrix form and properties of the estimators derived via linear algebra—though I have never found a textbook that does so. Some papers that give pieces of the results:
Tocher, K. D. (1952). The design and analysis of block experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 14(1), 45–91. doi:10.1111/j.2517-6161.1952.tb00101.x
For simple block experiments, results about the variance of estimates can be derived in terms of an incidence matrix \mathbf{n}, with rows corresponding to treatments and columns to blocks, whose entries give the number of units in that treatment and block. Gives matrix forms of variance and estimation results. This can be extended to various kinds of block designs, but I’m not sure it can be extended to, say, factorials; the paper suggests an incidence tensor and admits “pressure of space forces postponement of consideration of these designs to another paper.”
Nelder, J. A. (1965). The analysis of randomized experiments with orthogonal block structure. I. Block structure and the null analysis of variance. Proceedings of the Royal Society A, 283(1393), 147–162. doi:10.1098/rspa.1965.0012
Nelder, J. A. (1965). The analysis of randomized experiments with orthogonal block structure. II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society A, 283(1393), 163–178. doi:10.1098/rspa.1965.0013
Derivation of some ANOVA results relevant to blocked experiments in matrix notation, not sums of squares.
Tjur, T. (1984). Analysis of variance models in orthogonal designs. International Statistical Review, 52(1), 33–65. doi:10.2307/1403242
Detailed mathematical treatment for general orthogonal designs.
Mead, R. (1990). The non-orthogonal design of experiments. Journal of the Royal Statistical Society. Series A (Statistics in Society), 153(2), 151–178. doi:10.2307/2982800
The focus on manual manipulation of sums of squares means most classic experimental designs rely on orthogonality. Orthogonal designs ensure effects and contrasts can be computed by hand, without needing to manipulate and invert the design matrix. Mead argues that with modern (in 1990 terms!) computers, this restriction is unnecessary, and there are many good non-orthogonal designs that can be taught. Some good examples of design considerations for example experiments leading to non-orthogonal designs with unconventional blocking structure.
Bailey (2008), Design of Comparative Experiments, Cambridge University Press.
Book-length summary of the mathematical approach to design, focusing on linear algebra and combinatorial structures of design – but also focusing on orthogonal designs, regardless of Mead’s objections.
Cheng (2014), Theory of Factorial Design, CRC Press.
Similar to Bailey’s book, but perhaps expanded.
Easterling, R. G. (2004). Teaching experimental design. The American Statistician, 58(3), 244–252. doi:10.1198/000313004x1477
Some useful examples for teaching experimental design.
Hunter, W. G. (1977). Some ideas about teaching design of experiments, with 2^5 examples of experiments conducted by students. The American Statistician, 31(1), 12–17. doi:10.1080/00031305.1977.10479185
Discussion of teaching experimental design by having students design their own experiments, including 2^5 examples of experiments designed by students and some interesting pedagogical ideas, such as having students predict the outcome of the experiment in advance so they can be surprised.
Pollock, K. H., Ross-Parker, H. M., & Mead, R. (1979). A sequence of games useful in teaching experimental design to agriculture students. The American Statistician, 33(2), 70–76. doi:10.1080/00031305.1979.10482663
Some pencil-and-paper games for experimental design, in which students design experiments and receive simulated results. Could be adapted to a computer simulation instead. The settings are designed so that standard complete balanced block designs don’t necessarily work, to force students to be creative.
By which I mean “experiments performed on websites”, not “online” as in “online learning”.
Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010). Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 17–26). doi:10.1145/1835804.1835810. https://ai.google/research/pubs/pub36500
An overview of Google’s 2010-era experiment infrastructure. Splits system parameters into “layers”; parameters from separate layers can be tweaked independently without fear of conflict (e.g. some combination of parameter values producing unreadable pages). Users can then be in one experiment per layer, instead of just one experiment. Uses A/A experiments to track typical variance in metrics, for use in power analyses.
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1168–1176). doi:10.1145/2487575.2488217
A good discussion of how Microsoft uses controlled experiments. Sensitivity is important, because small improvements add up to millions of dollars of revenue annually. Takes the interesting position that “interactions are relatively rare and more often represent bugs than true statistical interactions”, so they do one-at-a-time experiments instead of multi-factor experiments.
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Sixth ACM International Conference on Web Search and Data Mining (pp. 123–132). doi:10.1145/2433396.2433413
Proposes the CUPED (Controlled-experiment Using Pre-Experiment Data) method for reducing variance in treatment effect estimation. Rejecting the use of ANCOVA to account for pre-treatment covariates for its restrictive assumptions, they propose a method based on “control variates”, which seems to be identical to ANCOVA with one covariate. I find this confusing.
Poyarkov, A., Drutsa, A., Khalyavin, A., Gusev, G., & Serdyukov, P. (2016). Boosted decision tree regression adjustment for variance reduction in online controlled experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 235–244). doi:10.1145/2939672.2939688
A cross of fancy machine learning and classic experimental design, following up on CUPED. (Fortunately they notice CUPED is just ANCOVA.) If ANCOVA is problematic because it assumes linear effects of covariates (though of course we could do nonparametric regression), why not predict the effects of covariates with decision trees? This would be unappealing in small-data experiments, where there’s hardly enough data to train a fancy machine learning algorithm, but in online experiments with millions of users, a fancy machine learning method may be entirely suitable to estimating covariate effects, and reduces treatment effect estimation variance for the same reasons ANCOVA does.
Hohnhold, H., O’Brien, D., & Tang, D. (2015). Focusing on the long-term: It’s good for users and business. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1849–1858). doi:10.1145/2783258.2788583
Describes the challenges facing experimental design in online experiments (ad experiments at Google). Sample size is not a problem. But there are per-user covariates, like ad blindness and willingness to use the product, which affect outcomes (ad revenue) but are also affected by the treatment (different ad placement parameters). A new ad placement scheme might improve revenue in the short term but train users to ignore ads in the long term.
This is counter to the usual problem in experimental design – that covariates might affect the choice of treatment, requiring us to randomize treatments to break the causal link. Here, the treatment affects the covariates, and the authors propose some clever ways to measure this effect, such as keeping some users in the treatment group for a long period while comparing them against users entered into the treatment group for only a day at a time. These led to a change that “decreased the search ad load on Google’s mobile traffic by 50%, resulting in dramatic gains in user experience metrics” which “would be so great that the long-term revenue change would be a net positive.”
[To read] Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., & Stevens, N. T. (2023). Statistical challenges in online controlled experiments: A review of a/b testing methodology. The American Statistician. doi:10.1080/00031305.2023.2257237
Hill, D. N., Nassif, H., Liu, Y., Iyer, A., & Vishwanathan, S. (2017). An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1813–1821). ACM Press. doi:10.1145/3097983.3098184
An adaptive method for multivariate (multiple simultaneous factors) experiments. Rather than running a factorial experiment, picking the winner, and putting the winner into production, the adaptive design sends users to factor combinations which either have high expected reward (they make money) or have high uncertainty. Avoids sending users to treatments which are sure losers, without waiting for the experiment to finish.