Volley

Part of the methodology series. Builds on Blind, Blind, Merge.

One prompt to Claude Code with codex CLI:

Problem description is in PROBLEM.md, code is in src/. Volley with codex to a spec, then blind-blind-synthesize into src/, then volley the synthesis with codex again to a draft PR. Ambiguity heuristic: no regressions, UX improvement.

You still pay attention at every checkpoint. The difference is the wait between them collapsed from hours to minutes.

This works when the output is verifiable: bug fixes, API implementations, data pipelines, spec-driven features. Tests can check the contract. It does not work for subjective outputs (UI design, creative writing, procedural art) where “is this good?” requires a human.

Result from a recent cast: a gemini-cli bug fix across 8 files, 138 lines of tests, zero revisions after merge. Three rounds. Four hours. On a codebase I didn’t write.

The problem before merge

Blind, blind, merge works when the input is sharp. But the method amplifies whatever you give it. Sharp spec → sharp code. Vague spec → two different kinds of wrong. Ambiguity in requirements compounds downstream: ambiguous pronouns in specs caused wrong associations in 44.6% of resulting models.

The original experiment hand-wrote the spec over two days. Twenty revisions. Fifteen design decisions. Fred Brooks called this out decades ago: “The hardest single part of building a software system is deciding precisely what to build. No other part of the work so cripples the resulting system if done wrong.” Where does a sharp spec come from?

Back and forth

Full workflow: goal → volley 1 (spec) → blind-blind-merge (impl, ~½ bugs) → volley 2 (clean PR) → ship. Human checkpoints at each gate.

State a goal. Two models bounce it between them until neither can improve it.

Claude drafts. Codex challenges. “What does ‘improve ranking’ mean?” “What metric?” “What’s the baseline?” “What assumptions are you making?” Claude adjusts. Codex challenges again. The volley ends when the ball stops moving. You stated the goal. The models did the sharpening.

The output is whatever survives the exchange: a document precise enough that implementation is mechanical and verification is a test suite. Each round replaces a vague claim with a testable one. The volley converges when every claim is either testable or explicitly marked as an assumption.

Why it converges

The two sides have different memory. Claude Code carries persistent context: the full conversation, every revision, every decision, every reason a claim changed; codex gets fresh context every call. No memory of previous rounds. No anchoring to earlier drafts. It reads the artifact cold, as a stranger would.

Claude drifts toward coherence with its own history. Codex can’t drift because it has no history. If a claim only makes sense because of a conversation three rounds ago, codex will push back. Kahneman called this the outside view: “the inside-view forecasts are not even close.” The spec converges when it doesn’t need the conversation history to make sense.

The persistent side accumulates understanding; the fresh side verifies legibility. Familiarity breeds complacency: the more familiar you are with a text, the more errors you overlook. Fresh eyes don’t have that context. They see the artifact, not the journey. Peer review outperforms self-review even when self-reviewers make more revisions. The stranger catches what the author can’t see. The volley automates stranger-review on every round.

Fixed point in two rounds

First round: codex challenges five assumptions. Second round: two clarifications. Third round: stable. You don’t specify the number of rounds. The rounds terminate when neither model can improve the shot. Delphi studies show the same pattern: across 287 consensus-finding studies, 90% finished in two to three rounds. Google Research’s AI co-scientist (2025) uses it too: “agents use automated feedback to iteratively generate, evaluate, and refine hypotheses.”

Adversarial convergence, compared

Method	Time to converge	Setup cost	Fresh eyes?
Delphi method	Weeks	Recruit panel, design questionnaire	Yes (anonymous experts)
Red team	Days	Staff a team, define scope, schedule	Yes (adversarial role)
Peer code review	Hours–days	Find a reviewer, wait for availability	Partial (reviewer has org context)
Self-review	Minutes	None	No (anchored to own decisions)
Volley	Minutes	None (one-liner)	Yes (codex reads cold every round)

Same convergence pattern. Neither side needs to be recruited, scheduled, or convinced to participate.

Four hours, eight files

A gemini-cli bug: setHistory() was silently dropping conversation state during context compression. The goal: “fix state loss without regressions.”

Volley round 1: codex identified three separate mutation paths that could corrupt state: setHistory() in geminiChat.ts, abort signal threading in chatCompressionService.ts, and a race in clusterSummarizer.ts. I would have fixed only the first.

Volley round 2: I pushed back that the abort signal was unrelated. Codex showed the call chain: compression triggers setHistory(), which triggers a re-render, which fires during an active resolveDirty() call. The abort signal isn’t unrelated — it’s the mechanism that prevents the race.

Round 3: stable. The resulting fix touched 8 files, added 138 lines of tests, and the PR passed gemini’s own review. Three rounds, four hours wall-clock, zero revisions after merge. On a codebase I didn’t write, with a bug that had three interacting causes I wouldn’t have found alone.

The double loop alone would have taken two to three days: one to understand the codebase, one to find all three mutation paths, one to get the tests right. The volley collapsed that into an afternoon because codex found all three paths in round one. I verified instead of discovered.

That was the learning round. Once practiced:

Fix	Diff	Time	Verified by
Decode multi-byte UTF-8 in API errors	+111/−6, 2 files	~15 min	Claude, codex, Gemini
Let vim consume escape during streaming	+43, 2 files	~15 min	Claude, codex, Gemini
Synchronous stderr write before exit	+7/−3, 1 file	~16 min	Claude, codex, Gemini

Three PRs, forty-six minutes total. Issue to PR. All with tests.

Three gates

Goal (gate: does this goal make sense?)
Volley → sharp spec (gate: does this spec match my intent?)
Blind, blind, merge → draft PR (gate: does the PR match the spec?)

Everything between gates is automated. Tests verify against the volleyed criteria. Your job is to check each gate’s output against your intent.

The volley sharpens the spec. Blind-blind-merge halves the bugs because complementary mistakes cancel. By the time the PR reaches you, a human who knows the codebase can inspect it, or one who doesn’t can trust the tests.

When it breaks

Shared blind spot: neither model knows the domain well enough to challenge a flawed assumption. The spec converges on a wrong premise.
Wrong problem: the description confidently explains the wrong thing. The volley produces an elegant solution to it.
Unobservable regression: a fix passes all tests but introduces a subtle performance cliff or UX degradation that no test covers.
Misleading architecture: a local fix is wrong because the real problem is structural.

Try it

Pick a bug; state the goal; run the cast; count the rounds; see what survives.

Written via the double loop.