Science on Trial

When you ask scientists about their work, they cite their positive results. When you ask the public what makes science credible, they cite peer review. When you ask ChatGPT for its sources, it points to publications as if publication were truth. If that all sounds fine, your notion of science is broken.

Science is supposed to be the act of changing beliefs in response to evidence. Every claim stands trial forever. Science is also, now, a noun. A body of credentialed outputs. An institution. The institution was built to support the activity, and has become more rewarding to maintain than the activity itself.

Do you “trust the science”?

The Scientific Process

Four centuries of arguing about what counts as knowledge produced a protocol. Each step exists because someone proved it was necessary.

Each step produces a public artifact: a registered prediction, an alternatives list, a timestamped log, a complete dataset. The trail is what counts as knowledge.

The Scientific “Process”

Routinely conflated. The exhibits below turn on keeping them apart.

The institution trains the reader to recognize publication as truth. Is it?

Now the exhibits.

Exhibit A. Missing Pre-registration

ClinicalTrials.gov came online in 2000, and prospective registration entered the norm for NHLBI-funded cardiovascular trials around the same time. Kaplan and Irvin (2015) compared 55 large NHLBI trials before and after. Pre-2000: 17 of 30 (57%) reported positive primary outcomes. Post-2000: 2 of 25 (8%). Same question, same funder, chain of custody added.

Of the 25 post-2000 trials, 12 still reported significant secondary outcomes. Without the mandate, 48% would have told a positive story anyway.

Scheel, Schijen, and Lakens (2021) found the same pathology in psychology reports against Registered Reports in the same field. Standard papers: 96% positive. Registered Reports: 44%. The baseline has been this high since Sterling reported 97% in 1959.

In that slice of high-stakes cardiovascular RCTs, the apparent success rate collapsed as soon as outcome declaration became visible. Those findings shaped treatment decisions for millions. Hormone replacement therapy (HRT) was prescribed to tens of millions of women on the strength of observational signals. The Women’s Health Initiative RCT (2002) was stopped early when its data safety monitoring board found combination HRT raised breast cancer and stroke risk. Prescriptions for combination HRT fell roughly two-thirds within a year.

Exhibit B. No Adversary

Under honest methodology, the published positive-result rate is capped by arithmetic. At α = 0.05 and 80% statistical power:

If the tested hypotheses are…ExpectationReported (Scheel 2021, psychology)
20% true20% positive96% positive
50% true43% positive96% positive
80% true65% positive96% positive
100% true (impossible ceiling)80% positive96% positive

Even the most generous assumption, that every tested hypothesis is true, cannot explain a 96% rate under honest methodology. The literature is not a discovery stream. There are three ways to produce the gap:

Burying nulls is fraud by omission. HARKing is fraud by commission. And the third requires no fraud: it’s career-preserving risk aversion. Of course a rational researcher operates under publish-or-perish. Regardless of which, all three produce the same rotten literature.

Sterling (1959) found 97.3% positive in four psychology journals. The same journals, thirty-six years later: 93–99%. Across twenty disciplines, Fanelli (2010) measured economics at 88.5%, psychology and psychiatry at 91.5%. Between 1990 and 2007, the rate grew another 22% (Fanelli 2012). Ioannidis (2005) gave the Bayesian version: given typical power and pre-study odds, the positive predictive value of a published biomedical finding is below 50%.

The system does not require red-teaming before publication. Everybody seems to be writing the positive papers that keeps the grant funding flowing.

If peer review was supposed to be adversarial, where’s the accountability for the reviewers?

Exhibit C. Work log

Yoshitaka Fujii fabricated data across 183 papers over nineteen years of peer review in anesthesiology. John Carlisle, a UK anaesthetist, ran a distributional goodness-of-fit test on Fujii’s published baseline statistics and reported the result in Anaesthesia (2012). The journals had no trail to audit.

What are the chances the caught fraudsters are the only ones frauding? A trial without a trail isn’t a trial at all.

If there was evidence to support their claims, how did it suddenly disappear?

Exhibit D. The citation graph doesn’t update

Serra-Garcia and Gneezy (2021) tracked citations in psychology and economics after replication failure. Papers that failed to replicate were cited more than papers that replicated, and only 12% of post-failure citations mentioned the failure.

Hsiao and Schneider (2021) ran the same question across biomedicine. Of 13,252 citation contexts referencing 7,813 retracted PubMed papers, 5.4% acknowledged the retraction.

The trail isn’t just incomplete. The reader has no reliable way to tell whether the citation they just followed is still standing. Retraction is supposed to propagate, but the citation graph never grants it.

That matters because the reader is often a textbook, a policy brief, a clinical guideline, or an LLM. Claims enter downstream corpora and compound. When the original retracts, the downstream copies don’t. The honesty-pledge result kept training students and shaping forms for years after its data was shown to be fabricated.

The view from the inside

I once watched The Act of Killing, a documentary in which perpetrators of the Indonesian mass killings cheerfully reenact their murders for the camera. From inside the culture that celebrated them, the evil was invisible. They posed, they laughed, they corrected each other’s technique.

The structure transfers even if the moral weight does not. A practice that looks pathological from outside is unremarkable from inside, because the local rituals make sense to the people performing them.

The invisibility runs to you, the LessWrong reader. Appeal to authority is a named fallacy. Did you ever believe someone because they studied at Harvard or Stanford? Did you find a paper more credible because it had many citations? If so, you’re a part of the problem. That reflex, unnoticed and routine, is the symptom under diagnosis.

Not every appeal to authority is a fallacy. For a layperson deferring to expert consensus, it is a reasonable division of epistemic labor; you can’t personally check every claim. But for one scientist citing another scientist’s credence, that’s when it becomes a fallacy.

The same invisibility produced eugenics. R. A. Fisher, patron saint of statistical inference and inventor of the p-value, was an active eugenicist. Peer-reviewed journals published the work through the 1930s, and forced sterilizations continued into the 1970s. The institution now treats eugenics as an alien moral failure rather than as work once published, funded, taught, and credentialed from inside the house. Credentialed consensus, journals, and prestige launder ideology as knowledge. That is the mechanism.

PhD students aren’t villains for wanting to be credentialed, nor are advisors for teaching them who to schmooze with. Nor are labs for pumping up those publication numbers. The institution is internally coherent. A discipline that can’t discipline itself is a profession.

Who is the jury?

If the jury is other credentialed scientists, science is a priesthood with extra steps. The same people who wrote the papers gatekeep their temple.

The outside jury is already in session. Retraction Watch logged 140 retractions in 2000 and more than 10,000 in 2023. Elisabeth Bik has surfaced image manipulation in more than 4,000 papers. Data Colada’s forensic analyses took down Ariely’s 2012 honesty paper and four of Gino’s; Harvard revoked Gino’s tenure in May 2025, apparently the first such revocation since at least the 1940s.

In March 2020, Bik publicly criticized Didier Raoult’s hydroxychloroquine paper, flagging non-randomized controls and six omitted treated patients, including one who died. In April 2021, Raoult’s IHU in Marseille filed a criminal complaint against her for harcèlement moral aggravé, tentative de chantage, and tentative d’extorsion. The prosecutor reportedly closed it in March 2024. The paper was retracted in December 2024.

The jury should be anyone who can read the evidence trail.

It’s a club

Put in economic terms, the scientific community functions as a cartel. Peer review is the admissibility rule. Citation is the currency. Credentialing is the output the members license. When an outside auditor produces an inspectable claim (Bik, Wilmshurst, Data Colada), the club answers with criminal complaints, libel suits, and defamation claims. George Carlin said it better than the auditors: “It’s a big club, and you ain’t in it.”

You would think that the journal is a stream of discoveries, but if you look closer, it’s a trophy case for grant applications. What can’t be displayed isn’t submitted, and what is submitted is written for the trophies.

The labor structure confirms the shape. NIH predoctoral stipends work out to $13.84 an hour at forty hours. It’s below the minimum wage in New York, San Francisco, or Boston. Research hours are rarely forty. Postdocs at the NIH scale approach or cross the line at 60–70 hour weeks. 14% of doctorate recipients self-finance out of personal resources. Only 7–8% of 2024 research-doctorate recipients had an immediate tenure-track job lined up. The lotto is negative EV.

The price of the ticket also rises. Academic postdocs grew from 12,500 in 1979 to 70,000 in 2024; multi-year postdoc chains are standard, and assistant-professor candidates arrive at hiring with materially more publications than their predecessors. The cartel’s temporal shape is a pyramid. Early entrants already cashed in. Later cohorts are inflating credentials against diminishing real positions, and each new generation subsidizes the last with cheaper labor and heavier dues.

Despite this, good work still gets done inside the club, by people paying personal cost to do it right.

”But Gino and Ariely were outliers”

Three of the four exhibits above describe honest work. Exhibit A, where 49 percentage points of positive cardiovascular findings evaporated under pre-registration, involves no fraud. Exhibit B, the 96% positive-result rate persisting since 1959, requires no fabrication. P-hacking, HARKing, and selective outcome reporting are all within the rules. Exhibit D, citations that don’t update when papers retract, is infrastructure.

Fraud is the one failure mode the institution acknowledges. The other three are what happens when the protocol is optional.

The 96% rate predates Ariely’s first paper by half a century. When Sterling documented 97% positive in 1959, Gino was not yet alive.

”But science is self-correcting”

If self-correction means publishing a retraction, yes. If it means the update propagating through the literature, no.

Hsiao and Schneider found that 5.4% of citations to retracted biomedical papers acknowledge the retraction. Serra-Garcia and Gneezy found that papers that failed to replicate are cited more than papers that did. Retractions are filed, not enforced.

A self-correcting system requires that the correction propagate faster than the original claim. The citation graph runs the other way.

The record also fails to carry the corrections the field already knows about. Replication failure is not a retraction criterion, so retractions alone won’t match replication rates. But the literature should carry visible caution markers (failed-replication notices, effect-size revisions, boundary-condition warnings) on papers whose claims the field has stopped believing. It doesn’t. A paper that stopped replicating a decade ago is still cited naked, indistinguishable from one that held up.

”But failed replication doesn’t mean the original was false”

A result that does not transport has not earned the general claim built on top of it. Pearl’s point: transportability requires explicit assumptions about causal structure and context, assumptions usually unstated and untested. If your finding holds only for college sophomores on a Tuesday in 1998, you didn’t discover a law of human nature. You recorded an anecdote. Publishing it as a law and citing it as such is the error.

”But pre-registration isn’t appropriate for exploratory science”

It is, and especially then. A declared goal shapes the exploration. Science without a goal is philosophy. Registering the question is the courage to be publicly wrong, and it coordinates other scientists who can shake maximal surprise out of the data instead of settling for a 95%-confidence prior. And since when is p = 0.05 an arbiter of truth?

”But truthseeking happens despite the institution”

Yes. That is the argument. Real discovery is already routing around the credentialing mechanism, inside universities as much as outside them.

Science is happening, outside the sanctioned peer-reviewed journal process.

”But the outside jury can become a mob”

Yes, outside scrutiny can become performative or punitive. Online auditors can be wrong; reputational damage can run ahead of adjudication; selective outrage is real. That is why the standard has to be verifiable claims, not vibes. The checker checks, and the checker is checkable in return. The inside, by contrast, has been publishing mob verdicts, credentialed and peer-reviewed and catastrophically wrong, regardless. Eugenics, recovered memory, cold fusion, ego depletion.

Peter Wilmshurst, co-principal investigator of the MIST cardiac-device trial, publicly challenged the trial’s handling at a 2007 conference and refused to sign off on the Circulation paper. NMT Medical, the sponsor, sued him for libel in English court. In December 2010 the High Court ordered NMT to post £200,000 in security for costs; NMT went into liquidation in April 2011 and the case died with it.

A noisy but auditable process is already more accountable than the trust-me-bro peers.

”But without peer review, how am I supposed to trust anybody?”

Credentialed consensus is how geocentrism survived. For centuries, Europe’s credentialed natural philosophers inherited a geocentric cosmology and defended it against heresy. The consensus broke not because of better review but because Kepler had data and Galileo refused to defer.

The standardized, journal-centered, anonymous external referee process now treated as the source of scientific trust is largely a twentieth-century consolidation. Darwin, Einstein, Newton, Maxwell, Mendel, Watson and Crick: none of their canonical work passed through today’s referee regime. Was science untrustworthy until the mid-twentieth century?

The objection itself is an appeal to authority. “How do I trust anybody” asks for a credential to defer to. That’s the lazy way out. Instead, verify. When verification was expensive, deferral is just what we happened to do. Now that verification and data analysis are 100x cheaper with AI, the excuses are running out.

To claim truth in science, implement:

Peer review catches errors before publication. Useful, not dispositive, insufficient. The institution lost the thread when it started treating it as a certificate of truth.

”But it’s the best we can do”

The argument: reviewers have finite bandwidth, editors have deadlines, and scientists cannot be expected to document every false start while producing new work.

Every other industry that does serious work refutes this. Software engineers run version control, code reviews, and postmortems. Aviation maintains incident logs and black boxes. Pharmaceutical labs keep rigorous notebooks because patent law requires it. Finance runs on audit trails. Medicine runs on surgical checklists and M&M conferences. None publish externally. All maintain internal trails that survive the individual’s departure.

Publishing the protocol does not mean announcing results to the world. It means committing to a trail the next person in your role can audit. The scope of that trail is whatever IP, regulation, or competitive position allows: private, classified, internal. The discipline is not optional.

My own exhibit

I have run the protocol on myself. In a preregistered experiment, the final posterior came in at 0.949, a thousandth below the 0.95 threshold I had committed to beforehand. By my own rule, the result does not confirm. I said so. Reporting a null I had every incentive to round up was harder than any positive result I have published.

Your turn

If you read science: ask whether you can reproduce what you cite. Reject credence on credence alone; demand the evidence, the methodology, the thought process that led to the conclusion. If you write science: pre-register, publish the nulls, keep the log, release the data, write the replication. Run every draft through an LLM adversary first: every citation, every claim, every inference. If the LLM can find the hole, the replicator will. If you fund or hire: trail quality as the metric. Mandate data sharing. Require Registered Reports for confirmatory work.

The reforms exist. Will the incentives change in pursuit of truth?


Written via the double loop.