See also Philosophy of statistics and Philosophy of science for a broader discussion beyond Meehl.
Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103–115. http://www.jstor.org/stable/10.2307/186099
A very simple argument: most psychological “theories” predict only the direction of an effect, and since we’re complicated enough that every intervention causes something to happen, psychological experiments have a probability of detecting the predicted effect asymptotically approaching 0.5 as sample size increases. Conversely, physical theories are harder to justify with increased measurement precision, because they make specific point predictions.
There are several other important points contained in this paper:While Meehl’s writing anticipates much later development in the field (like Gelman’s focus on “type S” errors, considering magnitude of effects more interesting than existence), none of the later authors had as much fun as Meehl:
Meanwhile our eager-beaver researcher, undismayed by logic-of-science considerations and relying blissfully on the “exactitude” of modern statistical hypothesis-testing, has produced a long publication list and been promoted to a full professorship. In terms of his contribution to the enduring body of psychological knowledge, he has done hardly anything. His true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring.
While the hints of sexism should best remain in the 1960s, the stylistic flair would be welcome in more modern papers.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834. doi:10.1037/0022-006X.46.4.806
Starts with 20 reasons why psychology is hard, then diverts to the philosophy of science and the nature of significance testing. Points out that to test a theoretical conjecture T, we need certain other auxiliary hypotheses A, plus conditions of the experiment C; we then obtain an observation O from the conjunction T \wedge A \wedge C \Longrightarrow O. Falsifying O merely implies \neg T \vee \neg A \vee \neg C, which does not prove much about T.
This goes back to the question of statistical hypothesis vs. substantive theory: the statistical hypothesis requires introducing A and C, and may only involve small parts of T, and yet we act as though significance immediately proves T.
Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141. doi:10.1207/s15327965pli0102_1
Some of this goes over my head. Reviews what he considers the current state of metatheory, including points made above, plus observations such as
Overall, however, I get the impression I need to read more Lakatos to understand the rest of this article.
Meehl, P. E. (1990). Why Summaries of Research on Psychological Theories are Often Uninterpretable. Psychological Reports, 66(1), 195–244. doi:10.2466/pr0.1990.66.1.195
Focuses on NHST-driven observational studies in “soft” parts of psychology. (He also lumps in experimental studies where the results depend on interactions between experimentally controlled variables and observed variables, such as age.) His thesis:
Null hypothesis testing of correlational predictions from weak substantive theories in soft psychology is subject to the influence of ten obfuscating factors whose effects are usually (1) sizeable, (2) opposed, (3) variable, and (4) unknown. The net epistemic effect of these ten obfuscating influences is that the usual research literature review is well-nigh uninterpretable.
He adds that
If the reader is impelled to object at this point “Well, but for heaven’s sake, you are practically saying that the whole tradition of testing substantive theories in soft psychology by null hypothesis refutation is a mistake, despite R. A. Fisher and Co. in agronomy,” that complaint does not disturb me because that is exactly what I am arguing.
Some of the points echo “Theoretical risks and tabular asterisks”, pointing out that auxiliary hypotheses and ceteris paribus assumptions are often as problematic as the theory under test, but also lists experimenter error (e.g. from unblinded experiments), inadequate statistical power, the “crud factor” (all null hypotheses are false), publication bias in various forms, and poor validation of tests and instruments.
There’s an interesting example: suppose we pick theories at random and “test” them by picking a pair of variables at random and testing their correlation. (These variables need have no relationship to the theory at all.) Given typical statistical power, and an assumed crud factor, the theory will be “confirmed” a third of the time or more. So if our scientific procedures work on gibberish just as well as they do on “real” theories and data, what does that say about them?
Meehl also returns to the issue of statistical hypotheses not directly corresponding to scientific hypotheses, pointing out that in, say, agronomy, refutation of the null is proof of exactly what we want to know: this fertilizer works better than the other one. But, with the crud factor and numerous auxiliary hypotheses and assumptions required in psychology, refutation of a statistical hypothesis means nearly nothing about the scientific hypothesis, and a few graduate students could come up with a dozen alternate explanations if provided free breakfast and a whiteboard.
I’d call this paper “the essential Meehl”.
Serlin, R. C., & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40(1), 73–83. doi:10.1037/0003-066X.40.1.73
A reply to Meehl’s 1967 and 1978 articles (above). Tries to resolve Meehl’s methodological paradox by abandoning falsificationism as unworkable, instead moving to Lakatos’s idea of research programs which defend their theory with a “protective belt” of auxiliary hypotheses which are the first to be blamed for any experimental failures. A theory fails when another emerges which is more powerful, not because of any falsification, which is impossible because of the auxiliary hypotheses.
This seems to be the argument that Feyerabend’s Against Method specifically tried to refute, but it has been too long since I read it last.
The paper goes on to propose the “good-enough principle”: have scientists determine in advance what experimental outcome would be “good enough” to accord with their theory. This seems like a tentative step in favor of preregistration or making more definite predictions, which I like, but then makes an odd statistical proposal: test the null |\mu - \mu_0| \geq \Delta, where \Delta is a “good-enough value”. That is, suppose our theory predicts \mu_0, but we know we will never exactly obtain this value, so we propose some range of experimental error we are willing to accept, and when we reject the null, we obtain support of our theory.
This still poses the problem that psychological theories are incapable of conjecturing \mu_0, let alone \Delta. But it does turn around the hypothesis test so that rejecting the null means rejecting our theory – putting the falsificationist logic of hypothesis testing the right way round. This isn’t explicitly acknowledged in the paper, which seems a bit confused on the role of null and alternative here, so I don’t know if this was intentional or well thought-out. Either way, I don’t think it resolves the problem of insufficient theoretical rigor.
Meehl’s 1990 article above tries to reply to this one.
Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42(2), 145–151. doi:10.1037/0003-066x.42.2.145
A retort to Serlin and Lapsley. Contains several points:
Yes, perhaps scientists aren’t strictly falsificationist: if a theory in physics does very well but misses the mark slightly in an experiment, we’d still be happy. But if the same happens for a theory in psychology, it should be in bad shape – the test in psychology is much less stringent, so failure is proportionally worse.
This seems to miss the question of inadequate statistical power: most tests in psychology can be expected to fail (assuming they don’t indulge in HARKing) from poor sample sizes.
Yes, theories have a “protective belt” of auxiliary hypotheses. But in physics these tend to be widely accepted – the working of a measurement apparatus, for example, is easily understood and accepted by other physicists. The same does not hold for interventions and constructs in psychology: regardless of the results of an experiment, we can probably devise numerous alternative hypotheses that could explain it.
“How can a theory be compelling if it is not, at least in some respects, unique in its ability to account for some experimental results? When it is so easy to come up with competing explanations for successful predictions and with ad hoc explanations for failing ones, why should the theory be taken seriously?”
If researchers built better theories and tested them more rigorously, a lot of the statistical machinery would not be necessary.