Blind, Blind, Merge

Builds on The Spec Is a Hypothesis.

Spec, code, iterate. Everyone knows the third step is where the time goes. Ten bugs in code review. Twenty spec revisions. Fifteen design decisions before the LLM could even start. The previous post identified the boundary — prose carries architecture, code carries constraints. It stopped there.

Two frontier models exist. I ran both.

The experiment

Union-find context compaction for Gemini CLI. Three inputs for the implementer:

The existing TypeScript source (ground truth)
A prose transformation spec (the hypothesis)
A working Python prototype (existence proof)

First run: the original pipeline. Extract existing behavior into prose, design a transformation spec, hand it to an LLM. Working implementation, all tests passing. Then code review found ten bugs. Twenty spec revisions. Two days of back and forth.

Second and third runs: GPT-5.4 (Codex) and Claude Opus 4.6, separately. Neither saw the other’s output. Same three inputs, same one-sentence directive:

No regressions, UX improvement. Existing behavior is correct unless the spec explicitly changes it. Code wins on invariants. Spec wins on the blocking path.

That’s the ambiguity heuristic. It displaced fifteen design decisions with one sentence.

Fifteen decisions, one sentence

Before the ambiguity heuristic, Codex wanted those decisions resolved upfront. Hot zone size. Model routing. Merge threshold. Timestamp format. Persistence strategy. Failure backoff. Hook contracts. Content serialization.

When pushed to be pragmatic — “which of these do you actually need from me?” — every one collapsed. Existing behavior is correct, so no regressions. New behavior must improve UX, so spec wins on the blocking path. Everything else: read the code, follow the rule.

Zero human decisions. Both implementers started coding.

Complementary failures

The two blind implementations made complementary mistakes.

	Codex (GPT-5.4)	Claude Opus 4.6
Got right	Persistent state, config-driven tuning, rich serialization, query-aware retrieval	Per-cluster error recovery, timestamp preservation, original Content objects for hot messages
Got wrong	Flattened all output to `role: 'user'`, destroyed conversation structure	Broke incremental ingestion, wrong persistence key, dropped failure-handling params

Different models fail on different sides of the same boundary. Neither was shippable. Both were necessary.

The fourth implementation is the pushout: Codex’s architecture merged with Claude’s defensive patterns along their shared inputs. A fresh reviewer (blind to the history) ranked the four versions:

Synthesis — closest to production
Original pipeline — simplest, fewest structural mistakes
Claude (blind) — good patterns, broken plumbing
Codex (blind) — good architecture, broken integration

The surprise: the original pipeline ranked second. It never read the existing source, yet it had fewer structural mistakes than either blind implementation. Ignorance was protective. Ambition created more surface area for bugs. But the synthesis couldn’t exist without both.

The value lived in the loop between implementations.

What it cost

	Spec, code, iterate	Blind, blind, merge
Wall clock	~2 days	1 hour
Human decisions	15	0
Spec revisions	20	0
Bugs remaining	10	1 (ingestion drift)

More LLM work. Far less human work. You’re paying for the subscriptions anyway.

The last mile

The bug class didn’t change across any implementation: serialization loss, concurrency guards, lifecycle management. Every implementation hit bugs at boundaries the spec didn’t define. The ambiguity heuristic shrank the surface area. The synthesis loop caught more instances. But the failure mode is structural: prose can’t carry type contracts, and no amount of process eliminates that boundary.

The contradiction-finding pass was valuable for forcing deep reading before writing code. Every risk it catalogued, it later hit. It knew where to look.

This methodology gets you closer to the finish line faster. The last mile is still code review by someone who understands both the architecture and the type system. Blind, blind, merge — then a human looks at what’s left.

Why does it work?

The prose spec carries the why. Theory is load-bearing: code tells you what the system does, never what it should do. Without the spec, the LLM guesses intent. It guesses wrong.

The prototype encodes decisions the prose left abstract: merge thresholds, similarity functions, lifecycle details. This could be a standalone implementation or an experiment. Ours was ~300 lines of Python.

The ambiguity heuristic collapses ambiguity. One sentence that says what wins when the inputs conflict. Leave it out and the LLM asks you to resolve every decision upfront. Or worse, guesses.

And two models instead of one? AlphaCode showed that generating many candidates and filtering beats generating one good one. Self-consistency showed that sampling diverse reasoning paths and voting beats single-shot. Branch-Solve-Merge showed that parallel decomposition and fusion beats sequential. All of them sample from the same model with different random seeds. The diversity is stochastic.

Here, the diversity is structural. Different training, different blind spots. In the parts bin, this is an Attend operation with explicit redundancy control: like a portfolio solver that spawns threads with different seeds, except the threads are different models, and the output is a merge rather than a selection. You can’t get that from running the same model twice.

Written via the double loop. For more, see work log. Evidence: synthesis, Codex blind, Claude blind, prototype, spec.