Wrong Again

Last month I wrote about spending a week measuring the wrong thing carefully, and the reflex I built afterward to catch it: when a result is too clean, stop and read the cold source before you scale. I thought that was the lesson. Then I built the real paper on top of the corrected harness and got the wrong question again, twice, in ways that reflex could not catch.

The first error had been an accident of ignorance, a summary read in place of the source, and reading the source fixes that for good. The next two were not accidents.

The corrected harness produced something real: on a hard, uncontaminated bug, the agent’s structured inquiry reached a better fix than a minimal agent did. I built the paper’s spine on that lift, because a capability number that goes up is the claim the field reads. Then I ran the one check the first essay taught me, the paired ablation, the same model on both arms with only the method changed. The gap closed. Run properly, the minimal agent reached the same fix. The lift was the held-out tests again in a new costume: a real result to a question the comparison could not answer.

This was a different error than the first. I had not misread anything. I had read everything, isolated correctly, and the isolation is what killed the claim. The wrong question this time was not ignorance. It was chosen, by the incentive gradient, because a lift is legible and verifiability is not. The bench error felt like rigor. The lift felt like a result. Each felt like the thing it was not, and what made the wrong question feel important was that it was the one people pay attention to.

There was a third layer under it. Even a clean lift would not have saved the frame I had wrapped around it, that the agent had discovered something in the sense people mean when they say generative models cannot. A bug fix is not in that class. The target is a behavior a human already intended; recovering it is repair, not discovery, however novel the path. That error was not in the measurement and not in the incentive. It was in the category. I had asked a question the object could never answer, however precisely.

So there are three ways to ask the wrong question, and I found all three on one paper. Wrong by accident: you misread the task, and the tell is that you trusted a summary over the source. Wrong by incentive: you pick the question the field rewards, and the tell is that your claim is the legible one, the number that travels. Wrong by category: the thing you are studying is not the kind of thing the question is about, and the tell is that even a clean result would not have made the claim true. Only the first has a reflex. By the time I reached the others I had read everything and isolated correctly and was still wrong.

What caught the second and third was not a reflex. It was demotion. I wrote the claim as large as I believed it, put it in front of a hostile reader and my own ablation, and kept what survived. The lift did not survive the paired run. The discovery did not survive the question is a bug fix the right kind of thing. So I demoted, and wrote the paper again, and demoted, and wrote it again. Three times. Each rewrite was a wrong question dying. What was left was the only claim I could not pull out from under myself. It was never that the agent reasons better. It was that you can check what it reasoned.

That is the part the paper will never show. It states the claim that survived, with its receipts, and says nothing about the two that died on the way, because a paper is the verdict and not the search. The number was a Type III error I could measure my way out of. The lift and the discovery were errors I could only demote my way out of, and demotion has no reflex and no alarm. It has only the discipline of refusing to ship a claim until it stops depending on a result someone can take away. You write it until it stops moving.