See also Causality, Observational studies.
Many experimental design textbooks, written for practicing scientists, don’t explain why we’re going to all of this trouble to design elaborate treatment allocations. Why, exactly, do I want to use a Latin square over some other allocation of treatments? In most cases, the purpose behind designs is control of estimation variance: by choosing treatment allocation carefully, we can obtain a treatment effect estimate that has the lowest possible variance, given our sample size constraints. This involves clever tricks like making treatment effects orthogonal to other effects by design.
Imbens and Rubin (2015), Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction.
Part II, “Classical Randomized Experiments”, defines the basic randomized experiments (completely randomized, blocking, etc.) in terms of potential outcomes.
[To read] Ding, P., Li, X., & Miratrix, L. W. (2017). Bridging finite and super population causal inference. Journal of Causal Inference, 5(2). doi:10.1515/jci-2016-0027
Mutz, D. C., Pemantle, R., & Pham, P. (2018). The perils of balance testing in experimental design: Messy analyses of clean data. The American Statistician. doi:10.1080/00031305.2017.1322143
Some people recommend that in a randomized experiment with assorted covariates (like patient demographics), you should test that the covariates are roughly equal between treatment and control groups, indicating the randomization was “successful”. This doesn’t make sense, particularly if we use it as justification for including or excluding specific covariates from the model.
[To read] Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10(3), 273–304. doi:10.1214/ss/1177009939
Easterling, R. G. (2004). Teaching experimental design. The American Statistician, 58(3), 244–252. doi:10.1198/000313004x1477
Some useful examples for teaching experimental design.
Hunter, W. G. (1977). Some ideas about teaching design of experiments, with 2^5 examples of experiments conducted by students. The American Statistician, 31(1), 12–17. doi:10.1080/00031305.1977.10479185
Discussion of teaching experimental design by having students design their own experiments, including 2^5 examples of experiments designed by students and some interesting pedagogical ideas, such as having students predict the outcome of the experiment in advance so they can be surprised.
By which I mean “experiments performed on websites”, not “online” as in “online learning”.
Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010). Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 17–26). doi:10.1145/1835804.1835810. https://ai.google/research/pubs/pub36500
An overview of Google’s 2010-era experiment infrastructure. Splits system parameters into “layers”; parameters from separate layers can be tweaked independently without fear of conflict (e.g. some combination of parameter values producing unreadable pages). Users can then be in one experiment per layer, instead of just one experiment. Uses A/A experiments to track typical variance in metrics, for use in power analyses.
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1168–1176). doi:10.1145/2487575.2488217
A good discussion of how Microsoft uses controlled experiments. Sensitivity is important, because small improvements add up to millions of dollars of revenue annually. Takes the interesting position that “interactions are relatively rare and more often represent bugs than true statistical interactions”, so they do one-at-a-time experiments instead of multi-factor experiments.
Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Sixth ACM International Conference on Web Search and Data Mining (pp. 123–132). doi:10.1145/2433396.2433413
Proposes the CUPED (Controlled-experiment Using Pre-Experiment Data) method for reducing variance in treatment effect estimation. Rejecting the use of ANCOVA to account for pre-treatment covariates for its restrictive assumptions, they propose a method based on “control variates”, which seems to be identical to ANCOVA with one covariate. I find this confusing.
Poyarkov, A., Drutsa, A., Khalyavin, A., Gusev, G., & Serdyukov, P. (2016). Boosted decision tree regression adjustment for variance reduction in online controlled experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 235–244). doi:10.1145/2939672.2939688
A cross of fancy machine learning and classic experimental design, following up on CUPED. (Fortunately they notice CUPED is just ANCOVA.) If ANCOVA is problematic because it assumes linear effects of covariates (though of course we could do nonparametric regression), why not predict the effects of covariates with decision trees? This would be unappealing in small-data experiments, where there’s hardly enough data to train a fancy machine learning algorithm, but in online experiments with millions of users, a fancy machine learning method may be entirely suitable to estimating covariate effects, and reduces treatment effect estimation variance for the same reasons ANCOVA does.
Hohnhold, H., O’Brien, D., & Tang, D. (2015). Focusing on the long-term: It’s good for users and business. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1849–1858). doi:10.1145/2783258.2788583
Describes the challenges facing experimental design in online experiments (ad experiments at Google). Sample size is not a problem. But there are per-user covariates, like ad blindness and willingness to use the product, which affect outcomes (ad revenue) but are also affected by the treatment (different ad placement parameters). A new ad placement scheme might improve revenue in the short term but train users to ignore ads in the long term.
This is counter to the usual problem in experimental design – that covariates might affect the choice of treatment, requiring us to randomize treatments to break the causal link. Here, the treatment affects the covariates, and the authors propose some clever ways to measure this effect, such as keeping some users in the treatment group for a long period while comparing them against users entered into the treatment group for only a day at a time. These led to a change that “decreased the search ad load on Google’s mobile traffic by 50%, resulting in dramatic gains in user experience metrics” which “would be so great that the long-term revenue change would be a net positive.”
[To read] Larsen, N., Stallrich, J., Sengupta, S., Deng, A., Kohavi, R., & Stevens, N. T. (2023). Statistical challenges in online controlled experiments: A review of a/b testing methodology. The American Statistician. doi:10.1080/00031305.2023.2257237
Hill, D. N., Nassif, H., Liu, Y., Iyer, A., & Vishwanathan, S. (2017). An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1813–1821). ACM Press. doi:10.1145/3097983.3098184
An adaptive method for multivariate (multiple simultaneous factors) experiments. Rather than running a factorial experiment, picking the winner, and putting the winner into production, the adaptive design sends users to factor combinations which either have high expected reward (they make money) or have high uncertainty. Avoids sending users to treatments which are sure losers, without waiting for the experiment to finish.