Volley

Part of the methodology series. Builds on Blind, Blind, Merge.

Virtu-Volley: two low-poly robots in beach volleyball attire, one spiking and one diving, Virtua Fighter arcade aesthetic

One prompt to Claude Code with codex CLI:

Problem description is in PROBLEM.md, code is in src/. Volley with codex to a spec, then blind-blind-synthesize into src/, then volley the synthesis with codex again to a draft PR. Ambiguity heuristic: no regressions, UX improvement.

The problem description does the heavy disambiguation. Half the document describes the problem; the solution is mentioned briefly; the rest explains why it matters. The ambiguity heuristic handles design decisions (“no regressions, UX improvement”). The problem description handles why — what the human considers quality, what tradeoffs to make, what to optimize for. A checklist extracted from the description performs worse than the description itself. The why is what aligns the model to the human’s intent.

You still pay attention at every checkpoint. The difference is the wait between them collapsed from hours to minutes.

This works when the output is verifiable: bug fixes, API implementations, data pipelines, spec-driven features. Tests can check the contract. It does not work for subjective outputs (UI design, creative writing, procedural art) where “is this good?” requires a human.

Result from a recent cast: a gemini-cli bug fix across 8 files, 138 lines of tests, zero revisions after merge. Three rounds. Four hours. On a codebase I didn’t write.

The problem before merge

Blind, blind, merge works when the input is sharp. But the method amplifies whatever you give it. Sharp spec → sharp code. Vague spec → two different kinds of wrong. Ambiguity in requirements compounds downstream: ambiguous pronouns in specs caused wrong associations in 44.6% of resulting models.

The original experiment hand-wrote the spec over two days. Twenty revisions. Fifteen design decisions. Fred Brooks called this out decades ago: “The hardest single part of building a software system is deciding precisely what to build. No other part of the work so cripples the resulting system if done wrong.” Where does a sharp spec come from?

Back and forth

Full workflow: goal → volley 1 (spec) → blind-blind-merge (impl, ~½ bugs) → volley 2 (clean PR) → ship. Human checkpoints at each gate.

State a goal. Two models bounce it between them until neither can improve it.

Claude drafts. Codex challenges. “What does ‘improve ranking’ mean?” “What metric?” “What’s the baseline?” “What assumptions are you making?” Claude adjusts. Codex challenges again. The volley ends when the ball stops moving. You stated the goal. The models did the sharpening.

The output is whatever survives the exchange: a document precise enough that implementation is mechanical and verification is a test suite. Each round replaces a vague claim with a testable one. The volley converges when every claim is either testable or explicitly marked as an assumption.

Why it converges

The two sides have different memory. Claude Code carries persistent context: the full conversation, every revision, every decision, every reason a claim changed; codex gets fresh context every call. No memory of previous rounds. No anchoring to earlier drafts. It reads the artifact cold, as a stranger would.

Claude drifts toward coherence with its own history. Codex can’t drift because it has no history. If a claim only makes sense because of a conversation three rounds ago, codex will push back. Kahneman called this the outside view: “the inside-view forecasts are not even close.” The spec converges when it doesn’t need the conversation history to make sense.

The persistent side accumulates understanding; the fresh side verifies legibility. Familiarity breeds complacency: the more familiar you are with a text, the more errors you overlook. Fresh eyes don’t have that context. They see the artifact, not the journey. Peer review outperforms self-review even when self-reviewers make more revisions. The stranger catches what the author can’t see. The volley automates stranger-review on every round.

Fixed point in two rounds

First round: codex challenges five assumptions. Second round: two clarifications. Third round: stable. You don’t specify the number of rounds. The rounds terminate when neither model can improve the shot. Delphi studies show the same pattern: across 287 consensus-finding studies, 90% finished in two to three rounds. Google Research’s AI co-scientist (2025) uses it too: “agents use automated feedback to iteratively generate, evaluate, and refine hypotheses.”

Adversarial convergence, compared

MethodTime to convergeSetup costFresh eyes?
Delphi methodWeeksRecruit panel, design questionnaireYes (anonymous experts)
Red teamDaysStaff a team, define scope, scheduleYes (adversarial role)
Peer code reviewHours–daysFind a reviewer, wait for availabilityPartial (reviewer has org context)
Self-reviewMinutesNoneNo (anchored to own decisions)
VolleyMinutesNone (one-liner)Yes (codex reads cold every round)

Same convergence pattern. Neither side needs to be recruited, scheduled, or convinced to participate.

Four hours, eight files

A gemini-cli bug: setHistory() was silently dropping conversation state during context compression. The goal: “fix state loss without regressions.”

Volley round 1: codex identified three separate mutation paths that could corrupt state: setHistory() in geminiChat.ts, abort signal threading in chatCompressionService.ts, and a race in clusterSummarizer.ts. I would have fixed only the first.

Volley round 2: I pushed back that the abort signal was unrelated. Codex showed the call chain: compression triggers setHistory(), which triggers a re-render, which fires during an active resolveDirty() call. The abort signal isn’t unrelated — it’s the mechanism that prevents the race.

Round 3: stable. The resulting fix touched 8 files, added 138 lines of tests, and the PR passed gemini’s own review. Three rounds, four hours wall-clock, zero revisions after merge. On a codebase I didn’t write, with a bug that had three interacting causes I wouldn’t have found alone.

The double loop alone would have taken two to three days: one to understand the codebase, one to find all three mutation paths, one to get the tests right. The volley collapsed that into an afternoon because codex found all three paths in round one. I verified instead of discovered.

That was the learning round. Once practiced:

FixDiffTimeVerified by
Decode multi-byte UTF-8 in API errors+111/−6, 2 files~15 minClaude, codex, Gemini
Let vim consume escape during streaming+43, 2 files~15 minClaude, codex, Gemini
Synchronous stderr write before exit+7/−3, 1 file~16 minClaude, codex, Gemini

Three PRs, forty-six minutes total. Issue to PR. All with tests.

Three gates

  1. Goal (gate: does this goal make sense?)
  2. Volley → sharp spec (gate: does this spec match my intent?)
  3. Blind, blind, merge → draft PR (gate: does the PR match the spec?)

Everything between gates is automated. Tests verify against the volleyed criteria. Your job is to check each gate’s output against your intent.

The volley sharpens the spec. Blind-blind-merge halves the bugs because complementary mistakes cancel. By the time the PR reaches you, a human who knows the codebase can inspect it, or one who doesn’t can trust the tests.

When it breaks

Try it

Pick a bug; state the goal; run the cast; count the rounds; see what survives.


Written via the double loop.