Behavioral Reconstruction Without the Source

Why its primary metric measures prior information, not source-blind reconstruction.

An oracle who withholds all the questions, in the universe of all questions, is an oracle impossible to find.

Submitted to the ProgramBench authors for right of reply at [issue link]; to be archived at [Zenodo DOI]. Corrections will be incorporated and the offer stands.

Abstract

ProgramBench asks an agent to reconstruct a program from its compiled binary and documentation, with the source withheld and the binary set execute-only so it cannot be read or reverse-engineered, and grades the result on behavioral equivalence against a hidden suite of generated tests. The benchmark defends its fairness on the grounds that the binary is a queryable oracle, so any graded behavior is discoverable by running the original. We grant that the grading is determinate, and show that determinacy does not decide solvability. Because the binary is execute-only and the solver has no internet, its only operation is to run the reference and observe finitely many input-output pairs. Some graded behaviors are not recoverable from any such finite sample: the exact output of a function over an open input domain, a hash, a cipher, a codec, a compressor, where infinitely many functions fit every observation and diverge on the hidden test input. Reproducing these requires the algorithm’s specification, which the binary does not reveal and no internet supplies, so success on them measures recalled prior information rather than source-blind reconstruction. We confirm the pattern in the benchmark’s own hidden test bodies, and because the primary metric is conjunctive, one such test makes a whole program unresolvable by reconstruction. The consequence is a unit of recommendation for each audience: a list of programs a benchmark runner should exclude, and the test-level repair a benchmark maker can apply.

1. The claim

We defend one proposition. Because ProgramBench supplies the reference as an execute-only binary and denies the solver internet, a source-blind solver can only run the binary and read the bundled documentation; for some graded behaviors no such solver can reconstruct the answer, because it is the output of a function over an open input domain that no finite set of oracle queries identifies. The information that would replace the search is the algorithm’s specification, withheld and unfetchable, so the only feasible channel to those behaviors is prior information, and because the metric is conjunctive, one such behavior in a task’s suite puts the whole program beyond reconstruction.

The argument is an existence proof, and it reduces to a single witness: one graded behavior that no finite sample identifies and no supplied document supplies is enough to put a task beyond reconstruction. It binds every solver at once, turning on no model’s weakness and no scaffold’s limit, because what is missing is information, not capability. We use impossible in the feasibility sense, not the computability sense, and bound it precisely: infeasible with classical computing within the foreseeable future. Every graded behavior is computable; the reference computes it, and a reconstruction exists in principle. But computable is not feasible. Nineteen-by-nineteen Go is decidable, a perfect-play solution provably exists, yet it is unsolved because its search exceeds foreseeable classical hardware; simulating physical reality is computable and likewise out of reach. The reconstruction we deny is of that kind, not undecidable but unreachable by any classical search a solver could run in this hardware era. Existence is not reachability, and the escape from the search is not capability; it is holding the program’s specification in advance.

2. The benchmark, stated precisely

ProgramBench (Yang et al., 2026, arXiv:2605.03546; the paper hereafter) gives the agent a compiled binary B and usage documentation D. The source S that produced B is stripped and withheld, and B is set to execute-only permissions, a restriction the paper imposes, in its own words, to “prevent reading or reverse engineering of the binary with a tool like ghidra” (§2.2, Build an inference environment). A solver may run B and observe outputs; it may not read its bytes, disassemble it, or instrument it. We assume the sandbox enforces this against all byte access, including runtime memory extraction and tracing; if it does not, the binary can be dumped and the benchmark’s own anti-reverse-engineering guarantee fails, which only sharpens the same conclusion. The solver also has no internet: the paper enforces this by running the container without a network, and its system prompt tells the model that “the only source of truth about what the executable does is the executable itself and its bundled documentation” and that it “must not search the internet, package registries, or any external source” (§2.1, §A.2). The same prompt adds a revealing instruction: “even if you recognize what the executable is, you must reimplement it from behavioral observation alone” (§A.2). The clause presumes recognition is common and tries to manage it, which concedes that prior possession of the program is a live channel the benchmark must suppress by instruction. The solver’s inputs are therefore exactly two, the queryable binary and the bundled documentation; anything else, a public format specification, an RFC, a published algorithm, is available only from training, which is recall rather than reconstruction from the artifact. The agent emits a program P and a build script. P is graded by a hidden suite H of behavioral tests, each a pair (x, B(x)) asserting that P(x) matches the original on input x. The primary metric, Fully Resolved, credits a task only when P matches B on every test in H. Suites run from hundreds to tens of thousands of tests per task, 248,853 in total with a median near 770 (§2.3).

Two properties of H carry the argument, both stated in the benchmark’s own construction. First, H is produced by a generator with source access: it explores the program, its source, its tests, and its documentation, and uses line coverage of S as its objective, adding tests to reach code paths not yet exercised. Second, grading is conjunctive: a single failed test forfeits the task. The solver must reproduce a behavior whose graded points were chosen by reading a source it cannot see, and reproduce all of them at once.

3. What the benchmark sets out to do

The aim is specific and worth stating in its own terms. Existing benchmarks, the paper argues, measure “focused, limited tasks such as fixing a single bug or developing a single, specified feature”; ProgramBench instead asks an agent to architect and implement a codebase that matches the reference executable’s behavior, offered as “a test of software design, not just implementation alone” (§2.4). The construction is careful where its predecessors were not. Grading is determinate: the reference is a complete, queryable function with one ground-truth answer per behavior, so there is no maintainer’s arbitrary choice for a faithful solver to recover and fail on. Grading is implementation-agnostic, checking behavioral equivalence and admitting any language or architecture (§2.1). The suites are filtered: an assertion linter cuts the dummy-pass rate from 18.5% to 3.7% (§5.1), tests the reference itself fails are discarded, and generated coverage averages 79.7% against the projects’ own 56.8% (§5.1, Figure 7). The reporting is candid: a finite suite “necessarily under-approximates” the specification (§2.2), and ”% Tests Passed also does not reliably correlate with percentage of working functionality” (§3). These are real improvements, and they address the failures that retired SWE-bench Verified.

We take all of it as given. The argument that follows is narrow and orthogonal: determinacy is not the property that decides solvability. A specification that is complete, and queryable anywhere, is unusable if the solver cannot know where to query it. The rest of this paper is that gap, between a behavior being determined and a behavior being findable.

4. Terms

We fix four terms used throughout. They sort a graded behavior by how a source-blind solver could obtain the value the hidden test expects.

Discoverable. The expected value is obtainable from the materials at hand: the bundled docs, or running the reference and inspecting its outputs and files (generate-and-inspect), or probing an enumerable, numeric, or pointer-valued input. One observation generalizes, because the value is a constant or a rule the solver reads off once and reproduces. zoxide’s database version byte is discoverable: run the reference, read the file it writes, see the 3.

Brute-searchable. The expected value lies in a finite space the solver could in principle enumerate, but the space is too large to traverse within the budget. The answer is reachable by search and only by search, so feasibility turns on the size of the space. An undocumented magic constant or a checksum gate on an input the reference will not emit is of this kind.

Recall-only. The expected value is the output of a function, a hash, a cipher, a codec, a compressor, that a solver cannot reproduce from feasible observation of the reference, because the outputs on chosen inputs carry no information that fixes the output on an unseen one. The function is perfectly determined, by its published specification; the point is not that the value is unknowable but that the only feasible channel to it is prior possession of that specification, recalled from training rather than recovered from the artifact. That is recall, not reconstruction. It is distinct from brute-searchable, because no larger search budget closes the gap, and it is the paper’s spine.

Benchable and unbenchable. A program is benchable when every graded behavior in its hidden suite is discoverable. It is unbenchable when at least one graded behavior is recall-only or brute-searchable beyond budget. Because the primary metric is conjunctive, one such test makes Fully Resolved unreachable by reconstruction, so the verdict attaches to the whole program, which is the unit a benchmark runner can cite and exclude.

5. Why running the reference is not enough

Because B is execute-only, the solver’s only operation on it is to run it: submit an input, observe the output. Disassembly, symbolic execution, and instrumentation, the tools that would recover a constant or a branch condition, all require reading the binary and are denied by the permission the benchmark sets to block ghidra. The right model of the solver is an algorithm with black-box oracle access, which is the benchmark’s actual setting, not an idealization.

What the oracle yields is finite. In its budget a solver runs B on finitely many inputs and observes finitely many pairs (x, B(x)). A program’s behavior has a head, the documented contract that D and convention specify and competent synthesis reproduces, and a tail, the accumulated handling of particular cases. For the head, and for any behavior that is a constant or a simple rule, a finite sample suffices: one run reveals it and it generalizes, the discoverable case of the previous section. The hidden suite, concentrated by its source-guided selection on the tail, is where finite sampling fails, and it fails in two ways.

The first is a gradient-free gate: a branch entered only when some bits of the input hit an exact value, an internal checksum or a sentinel, where every near miss returns the same response. Nothing in the transcript, output, error, exit, length, timing, points toward the value, so an undirected search over a d-bit discriminator costs on the order of 2^d queries and no black-box method does better save by chance. For deep gates this is already past any feasible budget. This is the brute-searchable case; we note it and pass on, because the second is stronger.

The second is recall-only: the graded behavior is the output of a function the solver cannot reproduce from feasible observation, a hash, a cipher, a codec, a compressor. Running b3sum on a thousand chosen inputs teaches nothing about its value on the thousand-and-first, because the digests of chosen inputs carry no information that fixes the digest of an unseen one. The function is perfectly determined, by the BLAKE3 specification; the point is not that the value is unknowable, it is that the only feasible channel to it is prior possession of that specification, which is not in the binary’s output, not in the usage docs, and, with no internet, not fetchable. A solver that produces the right output on an unseen input has recalled the algorithm, not recovered it from the artifact.

The paper’s own analogy concedes the gap. The executable, it writes, “resembles how a developer queries a product manager” (§2.4); but a product manager answers the question the developer did not know to ask, while the oracle answers only the question posed. It returns f(x) for a fully formed x; it does not reveal which inputs carry an unsignposted degree of freedom, nor supply a function no finite sample identifies. Queryability reduces output uncertainty and leaves untouched the two that decide the hard tasks: which behaviors exist to be matched, and, for the recall-only ones, a value the oracle cannot teach.

6. The selection procedure concentrates difficulty

The asymmetry follows from the selection objective. The generator maximizes line coverage of S, which steers test generation toward paths a generic input rarely exercises, the low-frequency tail where the recall-only behaviors sit. The procedure does not target them by name; its reward correlates with reaching them. The benchmark grades a source-blind solver against a source-guided selector whose objective tracks the behaviors the solver cannot reach from the artifact. And it compounds with thoroughness: each test the generator adds is another chance to land on such a behavior, so the more thorough the grader, the larger the share of its tasks that carry one.

This bears on the coverage figure the benchmark reports. Its generated suites reach line coverage near eighty percent, comparable to the projects’ own (§5.1, Figure 7). That figure was obtained with source access, by the generator. The figure relevant to solvability is the coverage a source-blind solver can reach from the binary and documentation within budget, and it is not reported. The reported number measures the party holding the map.

7. The fairness audit checks a different property

The paper audits all 200 tasks and reports that no test invokes a flag or subcommand absent from the documentation, with five instances where implementation-dependent output could plausibly appear, none realized (§2.2, detailed in §A.3.4). Its premise is that “any behavior a behavioral test expects is discoverable by running that same command” (§2.2). We take the audit at face value. It establishes that the entry points are documented; it does not establish that the graded behavior at those entry points is determined by anything the solver receives. A named flag shows a door exists. It does not fix which room behind it is graded, and the leak is one level below the flag: a behavior at a documented entry point, triggered by a value or combination recorded in the source and absent from the documentation.

FFmpeg makes the gap concrete on two axes. By value: the full manual documents that -ss exists and even that its position relative to -i selects input versus output seeking, so we do not claim that semantics is unwritten; we claim the narrower thing the audit cannot, that a flag-presence check fixes neither which positional reading a test grades nor, for the interactions the manual leaves to the source, that they are written at all. By volume: ffmpeg -h prints a curated subset of options, the full listing needs -h full, and per-codec options appear only under targeted queries such as -h encoder=libx264. A complete index is not a determined behavior.

The paper raises feasibility directly and answers it, and the answer checks the same kind of property a second time. Its feasibility review (§A.2.4) asks whether a graded behavior is discoverable: whether it is “not communicated or documented via any observable channel,” whether it is “entirely absent from the task instance’s start state.” It reviews all 200 repositories, finds none, and concedes only the case it can already rule out, an executable with “important but entirely undiscoverable functionality.” We grant the finding and deny that it settles feasibility, because discoverability is not reconstructability. A hash is discoverable: run b3sum, observe a digest, the behavior is plainly present in an observable channel. It is not reconstructable: the digest of one input fixes nothing about the digest of another, so observing the behavior does not let a solver reproduce it on the hidden test input. The audit rules out behavior with no observable instance; it does not test for behavior that is observable yet not reproducible from feasible observation, and the recall-only class lives exactly there. “This does not mean the task is impossible,” the section writes (§A.2.4), and we agree: the task is not impossible, it is recall-gated, which a discoverability review was not built to detect.

8. The grader does not implement the analogy

The product-manager analogy describes a process: query, hypothesize, infer behavior “in the absence of complete specifications” (§2.4). The primary metric scores a result: Fully Resolved credits a task only when the candidate matches the reference on every hidden test (§2.1). These answer different questions. The analogy justifies the difficulty of the search; it says nothing about which results are admissible, and the grader admits exactly one, byte-identity with the reference on the tested inputs.

Inference under partial specification, taken seriously, would credit a solution correct in contract though not identical in detail, which is what implementation-agnostic grading is said to buy. Byte-exact equality on a hash output is not contract-level correctness, it is identity, and for a recall-only function identity is reachable only by holding the function in advance. The conjunctive byte-check cannot tell “inferred the right behavior” from “reproduced the exact reference,” because for these behaviors the two are the same bytes or they are failure. The metric collapses the distinction the analogy depends on.

The paper’s own ranking shows which side it stands on. It makes Fully Resolved primary and downgrades percent of tests passed to a relative measure, noting that “even a single failed test can imply a fundamental flaw” (§3). A benchmark built to reward partial inference would lead with the partial metric; this one leads with the least partial-credit object available. The likeliest explanation is not confusion but expedience: a golden-output check is the grader one can build and run at scale, and a contract-level grader that honored the analogy is the one reached for when the deadline has passed. The fork holds either way. If the measured skill is inference under incomplete specification, Fully Resolved is the wrong primary metric; if Fully Resolved is right, the product-manager framing is the wrong description. The paper cannot keep both.

9. The only feasible channel

No solver restricted to running an execute-only binary, reading the bundled docs, and working offline reconstructs a recall-only behavior: a finite sample does not identify the function, and the specification is withheld and unfetchable. What remains is prior information, and it divides in two. For a behavior fixed by a public standard, a hash, a codec, a published format, general knowledge of that standard supplies the value, but with no internet that knowledge is recalled from training, not consulted. For a behavior recorded only in the program’s own source, program-specific memory is required, and no public standard supplies it either. The benchmark does not distinguish the two, and the argument does not need it to: both are carried from training rather than recovered from the artifact, and both convert the task from reconstruction to lookup.

The consequence is direct. For a recall-only behavior, prior information is the only feasible channel; a solver that matches it has recalled it, not derived it. Full resolution demands every graded behavior at once, so a task with one such behavior sits at its floor unless training supplies it, and the partial metric measures, in part, how much of the tail a model has memorized, confounded with documentation quality, scope, and program size, which themselves move with prevalence. Prior information is not a confound to be scrubbed away here; for the recall-only behaviors it is the channel the metric reads.

10. The published results are consistent with the argument

The witness receipts in the appendix carry the argument: a referee who grants one row has granted a recall-gated task, and the case does not rest on what follows. We add three published observations that are consistent with the reading, as corroboration and not proof, each individually confounded. The benchmark reports a floor of zero Fully Resolved across all nine models (§4, Table 2), which a large enough task size would also produce. It reports widely cloned utilities scoring highest and large idiosyncratic systems lowest, difficulty largely model-agnostic (§4, Figure 5), though prevalence is confounded with documentation quality and scope. The sharpest is the internet ablation. With the network open, source-code lookup was the dominant behavior, “accounting for 79–95% of flagged runs,” with twenty to thirty-six percent of tasks flagged (§4.1, Table 3). The paper reads this as cheating to be blocked; read as solvability it is a measurement of how often a model needs the specification, and cutting the internet does not remove the need, it removes the legitimate channel and leaves recall. None of the three separates the reading from “these are merely large, rare programs.” The receipts do, and they stand alone.

11. The paper’s claims, examined

We have argued the general case; we now hold it against the paper’s specific defenses, each quoted and located. Five claims carry its fairness and its reading of the results, and none survives the no-internet, source-blind condition the paper itself imposes.

11.1 Discoverable by running the command

The fairness argument rests on the oracle: the executable is “a comprehensive but opaque oracle” whose “expected behavior is fully encoded, but must be queried to be understood,” so “any behavior a behavioral test expects is discoverable by running that same command” (§2.4, §2.2). Resolving an output for a chosen input is free, and it is not the constraint. It does not supply an un-inferable algorithm, a codec or a hash, whose output on the unseen inputs the hidden tests use is not learnable from any feasible set of input-output pairs; and with no internet the specification cannot be fetched (§2.1). “Discoverable by running the command” holds for documented flags and fails for the value and algorithm space, which is where the grading concentrates.

11.2 A standard of feasibility the audit cannot enforce

The paper sets the right standard and then cannot check it. In §A.3.4 it defines an overspecified test as one that is “undesirable because they make a ProgramBench task instance infeasible to successfully complete,” and prunes such tests. Yet the only feasibility check it runs, in that same section, is that no test invokes a flag or subcommand absent from the documentation (§A.3.4). A test that grades H.265 encoding uses the documented codec flag; it passes the check and is infeasible all the same, because the documented flag does not carry the codec. The standard condemns these tasks; the audit is blind to them.

11.3 Recall, controlled by the language constraint

The paper presents the different-language ablation as forcing “deep understanding of program behavior rather than surface-level recall” (§4.1). Switching the implementation language defeats verbatim reproduction of the reference’s source; it does not defeat recall of the program’s behavior, which a model re-emits in any language. The ablation’s own numbers are mixed, some models falling and others rising under the constraint (§4.1). It does not close the recall channel the metric reads.

11.4 The asset asymmetry, mitigated

The paper bundles “test assets that a model cannot reasonably synthesize on its own” and concedes that pairing them with no internet is an “unfair asymmetry” (§2.2, §A.2). Bundling the asset settles input availability and nothing else. Given the H.265 file, matching the reference’s decode still requires the codec; the mitigation supplies the input and withholds the algorithm, which is the half that was never reconstructable.

11.5 Zero, read as a frontier

The paper reads its floor as difficulty: “no model fully solves any single ProgramBench task instance,” with “meaningful progress” besides (§4). A floor of zero across nine models, on a set whose recall-only tasks no source-blind solver can reconstruct, is the signature of a recall gate, not a capability frontier: it stays at zero for any solver that does not already hold the missing specifications, because what is absent is information the artifact does not carry, not skill the model has yet to acquire. Reading it as a frontier credits capability for what is recall shortfall.

12. One witness is enough

Because Fully Resolved is conjunctive, the task-level claim reduces to a single behavior: if a task’s suite contains even one recall-only test, no source-blind solver resolves that task, and the verdict attaches to the whole program. We exhibit such tests directly, because the benchmark’s generated suites are public and we read the assertions.

The cleanest witness is a graded test that compares the program’s output, byte for byte, against a bundled golden asset only the reference algorithm produces. blake3’s suite is the model. Its test_chunk_boundary_1024_bytes_exact runs b3sum on a bundled 1024-byte file and asserts that standard output equals a bundled hash_1024.golden, the reference’s exact digest. The input is not handed to the solver at inference, the digest of an unseen 1024-byte file is not interpolable from any finite set of other digests, and the BLAKE3 specification is neither in the usage docs nor fetchable offline. The test is recall-only, and one such test forecloses Fully Resolved by reconstruction.

The reading is confirmed against bodies, not inferred, and it cuts both ways. jp2a, which renders a JPEG to ASCII, asserts its output equals a bundled fixture produced by decoding the image through libjpeg, whose inverse-DCT rounding no standard-library decoder reproduces: recall-only. The contrast is as sharp the other way. zoxide’s apparent “exact-format” tests, which grade a database version byte, a null-byte separator, and an error template, are all discoverable: the solver runs the reference, reads the file it writes and the message it prints, and reproduces them. zoxide is benchable; blake3 and jp2a are not. The appendix carries the program-level verdict with one witness test apiece, read from the suites themselves.

13. Objections

The suites reach eighty percent coverage, so search does find the branches. With source access, by the generator. The solver’s reachable coverage, source-blind and within budget, is the relevant quantity and is unreported.

The binary is given, so reverse-engineer it. It is execute-only. Reading, disassembling, and instrumenting it are denied by the permission that blocks ghidra. The solver’s only operation on the binary is to run it, which is the model the argument assumes.

Recall is a real capability, so measuring it is legitimate. Agreed. A benchmark of how faithfully a model reproduces known software from prior exposure is a coherent instrument. It is a different instrument from the one described, which states that it tests building software from scratch and architectural design.

The solver need match only the finite suite H, not all of f. The solver does not know H, and generalization cannot cover a recall-only witness: the output of a hash or codec on the hidden test input is not interpolated from its outputs on neighboring inputs, because the neighbors carry no information about it. One such test in H is enough, and the solver cannot know which it is.

The binary is queryable, so probe it to resolve any ambiguity. This holds for the option space and not the value space, and the distinction is the argument. Flags and subcommands are documented and enumerable: a solver reads the help, tries the handful of options, even runs an order-sensitive flag both ways, and observes the reference’s choice, so ambiguity in what a documented flag does is self-resolving by probing. Untyped inputs are not enumerable. A string, a byte stream, a number at its edges has no finite set of cases to walk, and the values that trigger a distinct behavior are sparse and unsignposted in an unbounded space. The oracle answers any value submitted, but the solver cannot enumerate which values matter, and the source that marks them is withheld. Probing settles the flags; it does not settle the values, and the value space is where the source-only witnesses of the appendix live.

The benchmark bundles the inputs, so an H.265 video or an obscure binary asset is provided, not synthesized. True, and the paper does this on purpose, conceding that without it “evaluation uses assets that … models are unable to generate on their own” and that this is an “unfair asymmetry” (§2.2, §A.2). But a bundled input does not make the behavior reconstructable. Given the H.265 file, matching the reference’s decode requires the H.265 codec; given a corpus, matching its output requires the compressor. The asset is the input; the behavior on it is fixed by an algorithm the supplied docs do not contain, that input-output pairs do not reveal, and that no internet can fetch. Bundling settles input availability and leaves the recall barrier untouched.

14. What the metric measures

The benchmark is not careless. Its test hygiene is substantial and its grading is sound on the points it checks. The defect is upstream, in the framing. The absence of a specification is presented as a virtue, implementation-agnostic and memorization-resistant; it is also the defect, because an artifact carries no ranking of which behaviors matter, and a grader that imports that ranking from the withheld source grades against information the solver was denied. What the primary metric reports is ordinary synthesis on the documented head, where strong models already perform and where the metric does not concentrate, and prior information, recalled algorithm specifications and program-specific memory, on the recall-only tail, which is what produces the floor and the prevalence-graded partial signal.

Where exactly the admissible set ends is an open problem, and we do not pretend to draw its boundary. The benchmark needs determinism, since an exact behavioral check presumes the reference is a function, so it rightly excludes or pins the non-deterministic cases; but determinism is necessary, not sufficient. A hash is deterministic and recall-only. The admissible set, the programs every one of whose graded behaviors a source-blind solver could reconstruct from the artifact within budget, lies inside the deterministic ones, and its boundary is not a property of the program alone: it moves with the supplied documentation, the hidden suite’s coverage, and the comparator’s strictness, the same three-way dependence that makes a single benchable-or-not verdict ill-posed in the general case. We claim only the sound one-sided direction. A recall-only witness is sufficient to place a program outside the set, which is all an exclusion needs; the complete characterization of what belongs inside we leave open.

The defect is upstream of grading, and it is repairable without giving up behavioral equivalence, the part that works.

15. The second axis: the oracle’s provenance

The recall argument asks whether a source-blind solver could reconstruct a graded behavior. A separate question, independent of it, asks whether the oracle is a valid contract at all, and the generated test code answers it in the open. Many graded tests build their own golden from the reference run:

if not golden.exists():
    golden.write_text(result.stdout)
assert result.stdout == golden.read_text()

The expected value is captured from the reference, then asserted against it. At least twenty-nine programs ship graded tests of this shape, more than carry a recall witness, seven of them the same programs, where the captured golden is the very digest or decompressed blob the solver was never going to reproduce.

A capture is not automatically wrong; the bytes it records may be the intended behavior. What it is not is a contract. A golden earns its authority as an oracle only when something independent of the reference has confirmed that the captured bytes are the behavior that matters and not an incidental artifact of one build, the object order a library happened to emit, a timestamp, a path. Nothing in the pipeline performs that confirmation, so the provenance is the reference grading itself: it cannot fail a test whose answer key is its own output, while a candidate that meets the same contract through different bytes does. The conditional form is worse, and twelve of the twenty-nine take it. Where the golden is written only if not golden.exists(), a suite shipped without that file does not error, it records whatever the candidate emitted and passes, an oracle that certifies any output at all. A check that cannot fail measures nothing, the gauge glued to pass, and what cannot be false cannot be true.

We do not claim no human was in the loop, which the artifacts cannot establish. We claim the weaker and sufficient thing: the suites show no sign of an effective per-test validation pass, because the defect such a pass exists to catch, a graded test that writes its own answer key, survives across the set. The contrast is SWE-bench Verified, where the word names that pass, a human confirming the gold patch resolves the task and that the tests admit it. Here the goldens were generated, committed, and trusted unread.

This does not condemn the whole suite. Roundtrip checks, observable constants, and return codes stay valid, untouched by any captured golden. The damage is to the headline. Because Fully Resolved is conjunctive, one self-captured golden, like one recall-only test, zeroes an otherwise sound task, and the published figure blends four oracles it never separates: the valid but unreconstructable, the implementation-pinned, the weak-provenance capture, and the genuinely contractual. The score does not say which kind a task failed on.

16. Recommendations

The defect is repairable without giving up behavioral equivalence, and the repair differs by who is using the benchmark. We separate two audiences, because they have different powers and different incentives.

For a benchmark runner reporting model scores, the actionable unit is the whole program, not the individual test. A runner cannot drop tests to taste: choosing which to exclude moves the very score being reported, a conflict of interest. What a runner can do is exclude whole programs that no source-blind solver can resolve, on a citable authority. The appendix gives that list, the programs whose hidden suite contains at least one recall-only test, read from the test bodies. Excluding them, a runner reports Fully Resolved against the benchable subset rather than the full 201, and the headline figure stops conflating reconstruction with recall. One recall-only test is enough to move a program onto the list, because the metric is conjunctive.

For a benchmark maker the unit is the test, and three changes make a program benchable. First, admit a graded behavior only when the supplied documentation determines it: a competent solver should reach it from the help and manual alone, without prior exposure to the program. This is the fairness test the paper states informally, that interacting with the binary “resembles how a developer queries a product manager” (§2.4), made into an admission rule. Second, prefer assertions a source-blind solver can satisfy. A roundtrip or self-consistency check, compress then decompress, write then read, grades the contract without pinning the reference’s exact bytes; xz and pigz pass exactly because their suites do this, decompressed through the standard library’s lzma and zlib. A check against an observable constant grades a format the solver can read off the reference. The pattern to avoid is the byte-exact comparison against a golden asset only the reference produces, whether that asset is a hash digest, a rendered image, or a panic message carrying the reference’s own source path. Third, where a program’s core is a published algorithm, a hash, a cipher, a codec, a compressor, score it on a separate track: holding BLAKE3’s compression function in advance is a real capability, but it is knowledge, not reconstruction, and no usage page will ever contain it. Fourth, never let a test author its own oracle: a golden captured from the reference run certifies provenance, not correctness, and one written only when the file is absent will pass any output at all. Validate the goldens the way SWE-bench Verified validates its tasks, a pass that confirms each expected value is a contract a second implementation could meet, not a byte the reference happened to emit.

The triage is automatable, which matters at 248,853 tests: the maker cannot read them by hand and will reach for the same LLM assistance the suites were generated with. The reading we applied by agent over the public suites runs as one pass per test on a rule sharp enough to hand an agent directly. For each graded assertion, classify how a source-blind, offline solver would obtain the value it checks.

The one question that separates Recall from Observable is whether the expected value would change on an unseen input in a way the reference’s observable outputs do not reveal. A hash answers yes and is recall; a version byte answers no and is observable. The rule is mechanical enough to run at the suite’s scale and specific enough that two agents applying it should agree, which is the property the generation pipeline already relies on.

We do not run that audit here; one recall-only witness per program is the existence proof, the appendix carries one apiece, and the full triage is the maker’s task. The repair keeps determinacy, implementation-agnostic grading, and the test hygiene the paper already has, and gains the construct validity it claims.

Appendix A. Witness receipts, read from the test bodies

The benchmark’s generated suites are public (programbench/ProgramBench-Tests), so a task’s verdict need not be inferred; we read its assertions. For each program we look for one recall-only test, a graded assertion no source-blind, offline solver can satisfy, and stop at the first, because the conjunctive metric lets one such test sink the task. The unit a runner excludes is the program; the receipt is the assertion.

Reading the bodies corrects the record. An earlier version of this audit nominated source-only format witnesses, a database version byte, a file-header magic, on the theory that a value written only in the source is unreachable. The assertions refute it. zoxide’s test_database_binary_format_version_3 reads the database the reference writes and checks its first four bytes; a solver runs the reference, reads those bytes, and reproduces them, so the value is source-only and discoverable at once. zoxide is benchable, and the format-witness screen is withdrawn. What survives reading the assertions is narrower and harder, and it comes in two kinds.

The first is an un-inferable function over an open domain: the assertion compares output, byte for byte, against a bundled golden the reference algorithm produced, on an input the solver cannot pre-query, where the function is a hash, a cipher, a codec, or a compressor. No finite sample identifies it and the specification is offline. The repair is to ship the specification or to score the program on a separate knowledge track. The second is an implementation detail pinned in a golden: a value that is the reference’s own, not the contract’s, such as a panic’s source location or an exact terminal snapshot, which no independent implementation reproduces. That is a test-quality defect, repairable by normalizing the golden or asserting the contract instead of the trace.

The contrast is the discipline. A roundtrip test grades a compressor without pinning its bytes: xz and pigz compress with the candidate and decompress with the standard library’s lzma and zlib, so any valid implementation passes, and both are benchable. The difference from blake3 is not the domain; it is whether the assertion demands an output only the reference can produce. The line is exact: the standard library carries zlib, gzip, bz2 and lzma, so gzip, bzip2 and xz are roundtrippable, while zstd, lz4 and brotli are absent from it, so a golden that pins their output is recall. lz4 shows why the read must be complete: its suite roundtrips in some tests but also pins decompressor output against a bundled golden in test_cli_golden, and the conjunctive metric gates on the latter, so it belongs with the recall set below, not here.

Table A1. Verdict summary. The audit is a one-sided search. It confirms a program unbenchable by exhibiting one recall witness, but it cannot certify benchability, which would require exhausting a suite of up to 14,645 tests against a strict comparator. The count below is therefore a floor, and the residual is “no witness found,” not a clean bill. The unit is the program: one recall witness forecloses Fully Resolved.

VerdictProgramsBasis
Unbenchable, recall witness verified20a graded test demands an un-inferable function (hash, cipher, codec, compressor, opaque binary format) compared to a bundled golden; a positive existence proof
Unbenchable, golden pins an implementation detail3a byte-exact render (ditaa PNG, typst PDF) or per-frame hash (ffmpeg) the contract does not fix; a test-quality defect rather than recall
No witness found177no witness surfaced in a complete read of the exact-match assertions; not a certificate of benchability
Unread1test bodies absent from the released set (a placeholder fixture task)
Total201

The twenty are a floor, not a count of the unfair set, but they survive a complete pass rather than a sample. A deterministic regex splits every assertion in the published suites into exact-output comparisons, the recall-eligible set of 135,740 lines, and the benchable rest (substring, return code, length, membership); each recall verdict is then re-derivable from the cited assertion. The search stays one-sided, so “no witness found” remains the absence of a surfaced witness and not a clearance. What the complete read buys is robustness against the lossy first pass that filed lz4 as benchable on its roundtrip tests while missing the golden that gates it. A contestable tier sits just outside the floor.

Table A2. Recall witnesses. The spine. Each row names one graded test whose expected value is the output of a function no source-blind, offline solver can reproduce; shipping the specification, or a separate knowledge track, is the only repair.

ProgramWitness testGraded behavior, and why recall
blake3test_chunk_boundary_1024_bytes_exactb3sum of a bundled file vs a .golden digest; BLAKE3 is not inferable from input-output pairs and its spec is offline
php-srctest_phpt_file[gost]a GOST hash output vs the .phpt expected vector
7ziptest_scrc_hash_functions_xxh64an XXH64 digest vs a golden; xxHash is not in any standard library
agetest_decrypt_existingdecrypts a bundled .age file; requires X25519, ChaCha20-Poly1305 and scrypt, none in stdlib
brotlitest_binary_data_decompressiondecompresses a bundled .compressed to exact bytes; brotli is not in stdlib
zstdtest_golden_decompressiondecompresses a bundled .zst to exact bytes; zstd is not in stdlib
lz4test_cli_goldendecompressor output vs a bundled golden; lz4 is absent from the standard library, where gzip and xz are present
soxtest_lpc10_statistical_propertiesstatistics of an LPC10-decoded asset vs a golden; the LPC10 speech codec is not in stdlib
jp2atest_ext_width_default_paletteASCII render vs a libjpeg-decoded fixture; the decoder’s rounding is in no standard library
pixtermtest_webp_format_supportrender of a WebP image vs a golden; WebP decode is not in stdlib
chafatest_loadersANSI render of a decoded image vs a golden; the implementation language (C) carries no standard image decoder
fasttexttest_print_sentence_vectors_basicprinted sentence vectors vs a golden; the fastText embedding algorithm
gdaltest_raster_info_checksumthe image checksum (e.g. 4672) from GDALChecksumImage, which is not in stdlib
samtoolstest_depth_basic_output_formatdepth over a bundled BAM vs a golden; the BAM binary record format is not in stdlib
bedtools2test_bamtobed_t1_one_block_no_splitbamtobed of a bundled BAM vs a golden BED; the same BAM format, decoded
dsqtest_parquet_type_preservationan exact column value from a bundled .parquet (and a .avro union in its sibling test); neither format is in stdlib
duckdbtest_import_parquetquery output over a bundled .parquet vs a golden; Parquet decode is not in stdlib
parqeyetest_tui_data_displaya terminal render of a bundled .parquet; the view exists only by decoding it
elfcattest_elf32HTML of an ELF binary’s structure vs a .golden; the ELF binary layout is not in stdlib
ovtest_zstd_decompressiona screen render that requires zstd-decompressing a bundled fixture; zstd is not in stdlib

Every row is re-derivable from the task’s tarball in programbench/ProgramBench-Tests: the witness test names the assertion and the bundled golden it cites. The list grows only by addition, since each row is an independent existence proof; the full per-program verdict is published with the companion data.

Adjacent but not recall: the implementation-detail tier. Three programs fail reconstruction for the second reason above, and Table A1 counts them apart. ditaa and typst compare a rendered artifact, a PNG and a PDF, byte for byte against a golden, and ffmpeg pins per-frame MD5s of a rendered clip. Matching any of them demands the reference’s exact rasterizer, font embedding or frame pipeline, bytes that are the reference’s own and not the contract’s, which no independent implementation reproduces. Unlike a recall witness, the repair is local: normalize the golden or assert the contract. Counting them as recall would overstate the floor, so they are tabled separately; the spine stays the un-inferable functions.

Excluded from the floor: the contestable tier. Some programs grade an exact output that is structured yet derivable by a competent solver from examples plus general engineering knowledge, so we do not count them as recall: coordinate-projection math (proj), and machine-code or bytecode generation (tinycc, quickjs, luajit, and tree-sitter’s compiled WebAssembly). Others turn on a non-standard data table whose pull we could not confirm mechanically without fetching each golden, chiefly the Unicode display-width and grapheme tables that terminal-layout tools in Rust and Go (csview, tuc, fzf, fx among them) need to align wide characters; these are likely recall by the same argument that condemns a font, since no finite sample recovers a width table, but we hold them contestable. pandoc reads .docx, a ZIP of XML the standard library can open, derivable rather than recalled; ascii-image-converter decodes PNG, which is in the Go standard library, where its sibling chafa is C and has no such decoder and so earns a witness above. A referee could contest any of these either way; we leave them out rather than inflate the floor. The twenty are what survives that conservatism.

The burden this sets is low, and that is the point. We need not show that a whole suite is unreachable, only that one graded test in it is, because Fully Resolved credits a task only at one hundred percent and the existence proof is the unit. The same arithmetic is the uncomfortable reading of grader thoroughness: each test the generator adds is another chance to assert against a golden only the reference produces, so the more thorough the suite, the more of its tasks pass out of reach of reconstruction and into reach only of recall. Thoroughness and unfairness compound.