Abduction Tuner: a null result

Part of the methodology series. Sequel to Investigate.

I tried to abduce-transplant a hypothesis graph from speedygrad to PyTorch Inductor. Spent a day thrashing a 4080. No end-to-end improvement above measurement noise. The headline is the negative finding: for off-the-shelf NVIDIA hardware running off-the-shelf models, don’t autotune; use Inductor’s default mode. Default dispatches matmul through cuBLAS and conv through cuDNN, and no Triton-codegen autotune beats vendor-tuned kernels on canonical shapes.

The full evidence is in the PyTorch fork’s HYPOTHESIS_GRAPH.md. The summary below.

Generate and test

Both tinygrad and PyTorch’s Inductor compile ML operations into GPU kernels through the same mechanism: heuristics produce candidate configurations, a benchmarking harness times each one, the fastest wins.

In tinygrad, the linearizer defines a combinatorial space of axis splits, upcasts, and local sizes; BEAM search explores it at 200 candidates per kernel on real hardware. In Inductor, triton_heuristics.py generates 15-20 configs, the harness benchmarks all and picks the min, then coordinate_descent_tuning fine-tunes. Same interface; one searches a large space, the other a small one with hand-tuned starts.

Concept map: tinygrad components mapped to their PyTorch/Inductor equivalents. BEAM search maps to Triton autotuner. heuristic.py maps to triton_heuristics.py. Both pairs highlighted as the seam where the abduction engine connects.

In speedygrad, the abduction engine works: 52 trials vs BEAM’s 193 actions, 1.85x geometric mean speedup over the heuristic on 4/5 workloads. That motivated the transplant.

What I built

Three iterations of the engine, ported as a fifth Inductor autotune mode alongside default / max_autotune / coord_descent.

v1 — collect-all + static transitions. Coordinate descent with a hand-authored transition graph: BLOCK_M → {BLOCK_K, num_warps, BLOCK_N, num_stages} etc. If BLOCK_M improved, follow its edges instead of round-robin. The graph encodes hardware structure (tile size determines occupancy determines warp count) and stays small (10 parameters, ~30 edges).

v2 — accept-immediately + same static transitions. Take the first improvement instead of evaluating all single-parameter moves. Cheaper per round, same graph.

v3 — composed evidence + per-kernel hypothesis state. What the source theory actually requires. Per-kernel typed hypotheses (X_BOUND, R_BOUND, M_BOUND, WARP_BOUND, NEAR_OPT, etc.). Each measurement is a (before, after, field-changed) diff. Likelihoods update multiplicatively; the next experiment is whichever perturbation maximally discriminates between top live hypotheses, not a static transitions[winner] lookup. Acceptance requires composed signal above the detection cliff (≥5% effect after multiple samples), not first improvement above noise.

What the measurements said

Per-op microbenchmark on 142 paired cells across 5 torchvision models, 3 modes:

bench ratio vs coord_descentqualitysetup_sPareto-equivalent-or-better
v21.001.000.8376%
v3d (best variant)1.001.000.8583%

v3d is v3 with NEAR_OPT excluded from the keep-probing gate — a one-line fix that added 7pp Pareto over the original v3, after iterating v3a-v3c. It found above-noise per-op wins v2 never did: max_pool_backward 1.60×, sum.dim_IntList 1.17×, addmm 1.12×. The per-kernel state machinery picks different configs than coord_descent on those three ops, and the picks are faster.

End-to-end on the same 5 models, full forward pass with torch.compile and 30 steady-state iters:

modegeomean ratio vs defaultwinsloses
max_autotune1.00squeezenet (-12%)resnet18 (+21%), resnet50 (+17%)
coord_descent1.03mobilenet_v2 (-13%)resnet50 (+16%)
abduction (v2)1.11mobilenet_v2 (-10%)resnet18 (+54%), resnet50 (+27%)
coord_descent_threshold5pct0.974/5 modelsresnet50 (+6%)

The per-op view misled. v2 looked tied to coord_descent per-op and ran the worst end-to-end by a margin. Whatever the per-kernel wins are, they don’t compose into whole-model latency.

Then variance ate everything. A 3-rep canary on resnet18 the next day showed v2 at -3% (faster), threshold5pct at +38% slower: same algorithms, same model, ±40-57 percentage points across runs. The “+3% threshold5pct geomean win” sat inside measurement noise. Single-run e2e claims under ±50% are unreadable at this budget.

The seam doesn’t open

Three structural reasons the methodology can’t transfer the way it did in tinygrad:

cuBLAS escapes the autotune surface. Default dispatches aten.mm through cuBLAS; max_autotune turns that off and forces Triton enumeration. Autotune doesn’t fail to find a better config — it actively degrades performance by trading the cuBLAS path for Triton-codegen. Every autotune mode’s five worst losses are matmul cells where Triton runs 2.4-4.6× slower than cuBLAS (alexnet/mm: default 22.5µs, autotune 54-74µs). No methodology over the Triton config space catches cuBLAS on canonical shapes.

The lever set is missing GEMM tricks. Even excluding cuBLAS, Triton’s tunable_fields = {XBLOCK, YBLOCK, ZBLOCK, R0_BLOCK, R1_BLOCK, BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages} omits split-K reductions, Hopper TMA, async copy pipelining, swizzle patterns. The lever isn’t there to pull.

Measurement noise at 3-100µs kernel sizes overfits config selection. The harder the autotuner searches, the more it fits to noise: net (wins − losses) vs default goes -10pp for max_autotune, -15pp for coord_descent, -19pp for v2. Each “winning” config swap looks better at 5 repeats and is worse in truth.

The deepest cut: default’s variance runs 25-50× tighter than any autotune mode. Production cares about p99 tail, not just median. Default’s deterministic 1251µs (0.7% spread) Pareto-dominates threshold5pct’s 1280µs median (16.8% spread, worst-rep 1391µs).

What survives

Methodology contributions

From the wreckage:

Provenance

The PyTorch fork has the engine, the harness (benchmarks/abduction/), and the hypothesis graph with pre-committed predictions and falsifiers. Raw CSVs committed alongside.

The code is AGPL-3.0. The tinygrad experiments are public. This prose is CC BY-SA.


The investigation methodology is described in Investigate; the underlying primitive in Abduction; the prerequisites for accepting evidence in Before You Compose.