Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. doi:10.1037/h0045186
The seminal article. Studies in psychology had, at the time, less than 50% power for “medium” effect sizes.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105(2), 309–316. doi:10.1037/0033-2909.105.2.309
Follow-up on Cohen: the situation hasn’t gotten any better.
Maxwell, S. E. (2004). The Persistence of Underpowered Studies in Psychological Research: Causes, Consequences, and Remedies. Psychological Methods, 9(2), 147–163. doi:10.1037/1082-989X.9.2.147
Points out that the studies Cohen surveyed had, on average, 70 statistical tests each. Underpowered studies persist because there’s so much rampant multiple testing that you’re bound to find something significant, regardless of power, so nobody feels compelled to increase their sample sizes.
Moher, D., Dulberg, C. S., & Wells, G. A. (1994). Statistical power, sample size, and their reporting in randomized controlled trials. JAMA, 272(2), 122–124. doi:10.1001/jama.1994.03520020048013
Surveyed large RCTs and found that of those with negative results, only a third had 80% power to detect a 50% difference in groups.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews. Neuroscience, 14(5), 365–376. doi:10.1038/nrn3475
By comparing individual neuroscience studies to meta-analyses, assuming the meta-analysis effect sizes were correct, found that the median power of individual studies is 21%. Publication may make the assumption dubious, but in the direction of overestimation, which means the true power is possibly even worse.
Dumas-Mallet, E., Button, K. S., Boraud, T., Gonon, F., & Munafò, M. R. (2017). Low statistical power in biomedical science: a review of three human research domains. Royal Society Open Science, 4(2), 160254–11. doi:10.1098/rsos.160254
Repeated the above analysis in neurology and psychiatry, with similarly poor results.
Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), e2000797. doi:10.1371/journal.pbio.2000797
Analyzed cognitive neuroscience and psychology papers to show that the average study has 44% power to detect medium effect sizes, and the problem is worse for high-impact journals. Leads to an estimate that “false report probability is likely to exceed 50%”. Also calculates post-hoc power from the observed effect sizes, but see Hoenig and Heisey (2001) below.
Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, H. (2017). The power of bias in economics research. The Economic Journal, 127(605), F236–F265. doi:10.1111/ecoj.12461
A similar method as Button et al. (2013), applied to economics, finding that the median area in economics has only 10.5% of its papers with adequate (80%) power. Also shows that many of these areas suffer from serious truth inflation.
Charles, P., Giraudeau, B., Dechartres, A., Baron, G., & Ravaud, P. (2009). Reporting of sample size calculation in randomised controlled trials: review. BMJ, 338, b1732. doi:10.1136/bmj.b1732
RCTs report sample size calculations with insufficient detail, and their calculations often seem to be wrong. Only 34% of trials reported all necessary data, calculated accurately, and used appropriate assumptions.
Vankov, I., Bowers, J., & Munafò, M. R. (2014). On the persistence of low power in psychological science. The Quarterly Journal of Experimental Psychology, 67(5), 1037–1040. doi:10.1080/17470218.2014.885986
Found that authors of psychology papers often justify their sample size by what previous studies typically used (usually not enough), rather than by actual power analysis.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex, and power: statistical challenges in estimating small effects. American Scientist, 97, 310–316. http://www.stat.columbia.edu/~gelman/research/unpublished/power4r.pdf
Points out that, beyond Type I and Type II error rates, we also have to worry about overestimation of effect sizes: the significance filter combined with low power means only overestimates will be statistically significant. Gelman calls this “Type M” errors; I call it “truth inflation.”
Hoenig, J., & Heisey, D. (2001). The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1), 19–24. doi:10.1198/000313001300339897
It used to be common, after failing to obtain a statistically significant result, to calculate one’s statistical power to see if failure to reject the null is definitive or not. Hoenig and Heisey show this is fallacious.
[To read] Machery, E. (2012). Power and Negative Results. Philosophy of Science, 79(5), 808–820. doi:10.1086/667877
Claims to prove Hoenig and Heisey wrong?