Theory-testing in psychology and physics

Paul E. Meehl · 1967 · Philosophy of Science 34(2), 103–115

In physics, sharper instruments make theories harder to corroborate. In psychology, more data makes the null easier to reject. Identical method, opposite direction. The field that moves faster is the one where success is harder to fake.

The paradox

Meehl's observation is short and brutal. In physics, a theory specifies a point value with a tolerance. The experimenter improves the measurement. The tolerance shrinks. Now the theory has to pass through a narrower gate to count as corroborated. More data is a tougher test. Success means more.

In soft psychology — his examples are personality, clinical, social — a theory specifies a direction. "Group A will score higher than group B on this scale." The experimenter increases the sample size. The standard error shrinks. Any true correlation, however small, eventually reaches significance. The prior probability that some difference exists in the predicted direction approaches one half, then higher, as N grows. More data is an easier test. Success means less.

Same machinery — null hypothesis, p-value, rejection region — applied in two fields. In one, the method corroborates severely. In the other, it corroborates by ritual.

The crud factor

The mechanism is not subtle. In any large dataset of social-science variables, almost everything correlates with almost everything. Meehl reports Minnesota data on 55,000 high-school students measured on 45 miscellaneous variables: most pairwise associations were statistically significant, often at tiny p-values. Sex, birth order, religious affiliation, family size, room at home, club membership. All intercorrelated, mostly because in populations of that size the true correlation is never exactly zero.

He calls this the crud factor. The nil null hypothesis — the claim that two variables are exactly uncorrelated — is essentially never true in observational soft science. Rejecting it tells you almost nothing about any particular theory, only that the variables are not both random noise. That is not a finding.

What distinguishes a real corroboration from crud is how tight the prediction was before the data came in. A theory that specifies the effect is positive, greater than 0.20, in this subpopulation, under this manipulation makes a risky claim. A theory that specifies there is some positive effect somewhere makes a claim the crud factor guarantees. Nominally both get "significant" in a large enough sample. Only the first one was tested.

Why it matters

Meehl is writing in 1967, before the replication crisis had a name. The paper is prescient. He is explaining why soft psychology's literature will be full of successful tests that do not compound into cumulative theory. His diagnosis is not that researchers cheat but that their method cannot tell corroboration from crud without theoretical precision, which is exactly what soft psychology does not produce.

Later reform moves — preregistration, open data, replication — address symptoms. They do not fix the upstream problem: verbal theories over loosely-defined constructs do not make risky predictions, so no statistical apparatus can do the corroboration work it is asked to do. The discipline is not statistical but epistemological: what does the theory commit you to?

Meehl returned to this theme repeatedly. His 1990 "Theoretical risks and tabular asterisks" (Psychological Reports) is the harder-hitting follow-up: significance tests in soft psychology are "a potent but sterile intellectual rake." His 1997 "The problem is epistemology, not statistics" is the title as the argument: no Bayesian, frequentist, or preregistration fix reaches the construct-theory gap.

Risky vs. non-risky predictions

The useful vocabulary Meehl leaves behind is the distinction between point and directional predictions, and between risky and non-risky tests. Physics usually tests point predictions: the anomalous precession of Mercury's perihelion is 43 arcseconds per century; general relativity predicts 43. Any other number falsifies. Soft psychology typically tests directional predictions: treatment group higher than control. Half the real line falsifies; the other half corroborates. Under any prior belief that the manipulation does something, the test is non-risky.

Meehl's suggested remedy is to make predictions risky: narrow intervals, point values where possible, multi-parameter joint predictions where not. Unpopular, because it requires theories that commit to numbers, and most verbal theories in soft psychology do not. His deeper point is that if your theory does not commit to a number, you should not be running significance tests on it; the test will always pass.

Connection to the rest

🔬 Popper said risky predictions are what separate science from pseudoscience. Meehl is Popper applied to psychology. Where Popper frames risk in terms of logical content, Meehl frames it in statistical-power terms: a directional prediction with high N has low risk because the prior P(success) is ~½ even if the theory is wrong.

🔬 Platt recommended strong inference: design experiments that discriminate between alternatives. Meehl explains why strong inference is harder in soft psychology: the alternatives often all predict the same direction, so no experiment discriminates.

🔬 Ioannidis gave the Bayesian accounting of the replication crisis. Meehl gave the structural explanation thirty-eight years earlier. The 96% positive-result rate Sterling and Fanelli documented is what Meehl's paradox predicts: under directional prediction with high N, ~100% of tests succeed, because directional prediction is almost always non-risky.

🔬 Mayo formalized the severity requirement: a test only corroborates if it had a real chance of catching the hypothesis being wrong. Meehl's methodological paradox is the severity requirement applied to a field that systematically violates it. Mayo gives the criterion; Meehl shows the mechanism.

What he did not say

Meehl did not argue for abandoning psychology. He argued for changing what counts as a test. He kept doing psychology, built out the construct validity machinery with Cronbach (1955), and took the epistemological gap between theory and measurement seriously rather than statistically. The reform proposal embedded in the 1967 paper is not "ban p-values." It is "make the theory commit to a number before you compute one."

Neighbors

← Platt 1964 · 11 of 21 by june.kim Feynman 1974 · 13 of 21 →