See also Paul Meehl and psychology, on the nature of hypothesis testing and why it often provides limited evidence for or against theories. Also Philosophy of science.
Bolles, R. C. (1962). The difference between statistical hypotheses and scientific hypotheses. Psychological Reports, 11(7), 639. doi:10.2466/pr0.11.7.639-645
Makes an incredibly crucial point:
The problem here, basically, is that statistical rejection of the null hypothesis tells the scientist only what he was already quite sure of–the animals are not behaving randomly. The fact the null hypothesis can be rejected with a p of .99 does not give E an assurance of .99 that his particular hypothesis is true, but only that some alternative to the null hypothesis is true.
You must exclude alternate scientific explanations before you can conclude that a statistical result supports your scientific result.
There’s a lot of discussion about the statistical hypothesis being conditioned on statistical assumptions, and hence not being relevant to the scientific hypothesis if these assumptions are false; this is true, but I think can be phrased in terms of Meehl’s recognition that an experiment depends not just on the scientific theory but also a set of additional assumptions, experimental conditions, etc. Statistical assumptions are included there, and rejecting a statistical hypothesis means there’s a problem with the either the scientific theory, the statistical assumptions, assumptions which went into the experiment, auxiliary theories, etc… and we don’t know which.
Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799. doi:10.2307/2286841
Presents statistics and science as an iterative process:
Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory. Deductions made from the modified theory now may or may not be in conflict with fact, and so on.
The paper was the R.A. Fisher Memorial Lecture, so it spends some time discussing Fisher’s contributions to experimental design, emphasizing his mathematical ingenuity and close contact with the actual experiments. Focuses on building a good model, experimenting, and finding deviations from its predictions – via residual analysis – then finding the most promising route to fix it.
Kass, R. E. (2011). Statistical inference: The big picture. Statistical Science, 26(1), 1–9. doi:10.1214/10-sts337
Propounds “statistical pragmatism”, which doesn’t take quite so seriously the philosophical distinctions between frequentist and Bayesian inference, instead remembering that model assumptions are statements about the model we intend to use to simplify reality, not reality itself. The choice of model to represent reality depends on pragmatic concerns, not philosophical statements about the nature of probability, because the philosophical statements are all about the model and its parameters, not reality.
Gelman, A., & Shalizi, C. R. (2012). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. doi:10.1111/j.2044-8317.2011.02037.x
Argues against the common philosophy that a Bayesian need only specify a model and prior, then continually update the posterior with new information, eventually converging on the Truth. Instead,
In our hypothetico-deductive view of data analysis, we build a statistical model out of available parts and drive it as far as it can take us, and then a little farther. When the model breaks down, we dissect it and figure out what went wrong. For Bayesian models, the most useful way of figuring out how the model breaks down is through posterior predictive checks, creating simulations of the data and comparing them to the actual data.
The problem with the canonical Bayes-is-all-of-rationality view is that it assumes the truth of the model, or at least that the Truth is contained in the support of the prior; if we believe that our models are usually wrong or do not contain all relevant variables, our posterior may converge to something close to the Truth in likelihood, but “may or may not be close to the truth in the sense of giving accurate values for parameters of scientific interest”. Explores connections to Popperian falsificationism and gives practical examples of Bayesian model checking.
Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. doi:10.1214/10-sts330
A careful exploration of the distinction between explanatory and predictive modeling, and the implications of each for statistical methodology. Not ground-breaking, but the basic distinction is important: an explanatory model of the relationship between two variables wants the functional form of the relationship to be as close to the Truth as possible, whereas a predictive model only cares about predicted values being close to responses, regardless of the functional form of the relationship. This has implications for your choice of methods.