Abduction Tuner: a null result
Part of the methodology series. Sequel to Investigate.
I tried to abduce-transplant a hypothesis graph from speedygrad to PyTorch Inductor. Spent a day thrashing a 4080. No end-to-end improvement above measurement noise. The headline is the negative finding: for off-the-shelf NVIDIA hardware running off-the-shelf models, don’t autotune; use Inductor’s default mode. Default dispatches matmul through cuBLAS and conv through cuDNN, and no Triton-codegen autotune beats vendor-tuned kernels on canonical shapes.
The full evidence is in the PyTorch fork’s HYPOTHESIS_GRAPH.md. The summary below.
Generate and test
Both tinygrad and PyTorch’s Inductor compile ML operations into GPU kernels through the same mechanism: heuristics produce candidate configurations, a benchmarking harness times each one, the fastest wins.
In tinygrad, the linearizer defines a combinatorial space of axis splits, upcasts, and local sizes; BEAM search explores it at 200 candidates per kernel on real hardware. In Inductor, triton_heuristics.py generates 15-20 configs, the harness benchmarks all and picks the min, then coordinate_descent_tuning fine-tunes. Same interface; one searches a large space, the other a small one with hand-tuned starts.
In speedygrad, the abduction engine works: 52 trials vs BEAM’s 193 actions, 1.85x geometric mean speedup over the heuristic on 4/5 workloads. That motivated the transplant.
What I built
Three iterations of the engine, ported as a fifth Inductor autotune mode alongside default / max_autotune / coord_descent.
v1 — collect-all + static transitions. Coordinate descent with a hand-authored transition graph: BLOCK_M → {BLOCK_K, num_warps, BLOCK_N, num_stages} etc. If BLOCK_M improved, follow its edges instead of round-robin. The graph encodes hardware structure (tile size determines occupancy determines warp count) and stays small (10 parameters, ~30 edges).
v2 — accept-immediately + same static transitions. Take the first improvement instead of evaluating all single-parameter moves. Cheaper per round, same graph.
v3 — composed evidence + per-kernel hypothesis state. What the source theory actually requires. Per-kernel typed hypotheses (X_BOUND, R_BOUND, M_BOUND, WARP_BOUND, NEAR_OPT, etc.). Each measurement is a (before, after, field-changed) diff. Likelihoods update multiplicatively; the next experiment is whichever perturbation maximally discriminates between top live hypotheses, not a static transitions[winner] lookup. Acceptance requires composed signal above the detection cliff (≥5% effect after multiple samples), not first improvement above noise.
What the measurements said
Per-op microbenchmark on 142 paired cells across 5 torchvision models, 3 modes:
| bench ratio vs coord_descent | quality | setup_s | Pareto-equivalent-or-better | |
|---|---|---|---|---|
| v2 | 1.00 | 1.00 | 0.83 | 76% |
| v3d (best variant) | 1.00 | 1.00 | 0.85 | 83% |
v3d is v3 with NEAR_OPT excluded from the keep-probing gate — a one-line fix that added 7pp Pareto over the original v3, after iterating v3a-v3c. It found above-noise per-op wins v2 never did: max_pool_backward 1.60×, sum.dim_IntList 1.17×, addmm 1.12×. The per-kernel state machinery picks different configs than coord_descent on those three ops, and the picks are faster.
End-to-end on the same 5 models, full forward pass with torch.compile and 30 steady-state iters:
| mode | geomean ratio vs default | wins | loses |
|---|---|---|---|
| max_autotune | 1.00 | squeezenet (-12%) | resnet18 (+21%), resnet50 (+17%) |
| coord_descent | 1.03 | mobilenet_v2 (-13%) | resnet50 (+16%) |
| abduction (v2) | 1.11 | mobilenet_v2 (-10%) | resnet18 (+54%), resnet50 (+27%) |
| coord_descent_threshold5pct | 0.97 | 4/5 models | resnet50 (+6%) |
The per-op view misled. v2 looked tied to coord_descent per-op and ran the worst end-to-end by a margin. Whatever the per-kernel wins are, they don’t compose into whole-model latency.
Then variance ate everything. A 3-rep canary on resnet18 the next day showed v2 at -3% (faster), threshold5pct at +38% slower: same algorithms, same model, ±40-57 percentage points across runs. The “+3% threshold5pct geomean win” sat inside measurement noise. Single-run e2e claims under ±50% are unreadable at this budget.
The seam doesn’t open
Three structural reasons the methodology can’t transfer the way it did in tinygrad:
cuBLAS escapes the autotune surface. Default dispatches aten.mm through cuBLAS; max_autotune turns that off and forces Triton enumeration. Autotune doesn’t fail to find a better config — it actively degrades performance by trading the cuBLAS path for Triton-codegen. Every autotune mode’s five worst losses are matmul cells where Triton runs 2.4-4.6× slower than cuBLAS (alexnet/mm: default 22.5µs, autotune 54-74µs). No methodology over the Triton config space catches cuBLAS on canonical shapes.
The lever set is missing GEMM tricks. Even excluding cuBLAS, Triton’s tunable_fields = {XBLOCK, YBLOCK, ZBLOCK, R0_BLOCK, R1_BLOCK, BLOCK_M, BLOCK_N, BLOCK_K, num_warps, num_stages} omits split-K reductions, Hopper TMA, async copy pipelining, swizzle patterns. The lever isn’t there to pull.
Measurement noise at 3-100µs kernel sizes overfits config selection. The harder the autotuner searches, the more it fits to noise: net (wins − losses) vs default goes -10pp for max_autotune, -15pp for coord_descent, -19pp for v2. Each “winning” config swap looks better at 5 repeats and is worse in truth.
The deepest cut: default’s variance runs 25-50× tighter than any autotune mode. Production cares about p99 tail, not just median. Default’s deterministic 1251µs (0.7% spread) Pareto-dominates threshold5pct’s 1280µs median (16.8% spread, worst-rep 1391µs).
What survives
- The tinygrad result stands. speedygrad’s 1.85× over its own heuristic is real; tinygrad has no cuBLAS to fall back on.
- One small PR pitch survives the variance, at the chosen-kernel level: raise
coordinate_descent_tuner.has_improvementfrom0.001to0.05. ~4% median chosen-kernel speedup at matched compile cost (apples-to-apples microbench). The e2e signal from the same change is inside the variance band. One-line change, marginal even in the best case. - v3 isn’t dead, just undecided. v3d’s microbench is materially better than v2’s. The e2e canary was bad but inside the variance band. Deciding it would take per-config repeat budgets so high the GPU bill exceeds the methodology’s possible upside here.
Methodology contributions
From the wreckage:
- Triton autotune comparison at single-run granularity is unreliable; need 3+ reps with IQR-overlap testing.
- Per-op microbenchmarks misled here. v2 tied coord_descent per-op and ran the worst e2e mode by 8pp geomean.
- Cache isolation requires per-mode
TORCHINDUCTOR_CACHE_DIR, not just subprocess-per-mode. Disk cache survives process death and short-circuits later autotune runs tobench_calls=0. - “Best baseline” comparison must include the strongest mode (cuBLAS via default), not just the closest peer. We caught ourselves comparing v2 against
coord_descentwhiledefaultwas already winning both. - Identity probes pay for themselves on first run. Day one was lost to Triton having no Windows + Python 3.14 wheels: Inductor degraded silently to ATen, all autotune modes produced identical numbers, the harness reported clean CSVs. WSL install, retry, find the real signal.
Provenance
The PyTorch fork has the engine, the harness (benchmarks/abduction/), and the hypothesis graph with pre-committed predictions and falsifiers. Raw CSVs committed alongside.
The code is AGPL-3.0. The tinygrad experiments are public. This prose is CC BY-SA.
The investigation methodology is described in Investigate; the underlying primitive in Abduction; the prerequisites for accepting evidence in Before You Compose.