Encoding Expertise
Sequel to Make No Mistakes, Memory Compression, and Remediation.
The notification fires. A maintainer left a review on a PR you opened three days ago. Your terminal already shows thirteen other PRs in flight: four touched in the last hour, two with CI half-green, one with a question you haven’t answered, one merged you haven’t closed the loop on. Each owes a terminal state. Opening a PR is a promise. The maintainer’s attention and the issue slot are both non-renewable, and a PR that hangs after review wastes both. Slop wastes a reader who hadn’t engaged. A hanging post-review PR wastes attention already spent, and it squats the slot so nobody else picks up the issue. You consumed two scarce resources to deposit nothing.
You can’t keep routing fourteen of these by hand. Dropping any is the one move you can’t make.
State the problem cleanly. Precondition: a structured event lands in the actor’s inbox carrying the upstream contract’s shape (a PR identifier, the triggering signal, the message ledger). Postcondition: the actor emits exactly one bucket label from a fixed finite partition (qa, respond, comment-issue, investigate, human), routes the event to the matching outbox, and acks. Invariant: every accepted event reaches a terminal bucket; none may hang. Budget: low latency, high volume, finite rate-limit share. That is the classification problem. Everything that follows is what those constraints cost when the worker is a language model.
Act I — the failure
The first move is a skill. The classic Memory Compression one: repeated episodes crystallize into a function the substrate can call. Write a prompt. Here is the PR state, return one of: qa, respond, comment-issue, investigate, human. Wire it to the notification stream. Done.
It isn’t done. The bare LLM call drifts. It paraphrases the bucket name and ships a sixth one nobody can route. It reads a notification for an issue (not a PR) and confidently returns “qa.” It misses a verified fix and routes to the human inbox, where the operator sees the obvious shipped signal and curses. Twice in a row it returns the same confidently wrong bucket. It generates false-knowns: confident outputs whose underlying assumption is wrong, where the wrongness doesn’t surface until something downstream trips. The naive compression failed because all the expertise got pushed into one fuzzy nucleus that can’t hold it.
The second move is the wrapper from Make No Mistakes. Errors become loud, local, recoverable. A schema gates the skill’s output; around it the actor wraps throttling, jidoka, supervisor restart, leak counters. The boundary is observable. The wrapper catches the wrong answers, but it doesn’t reduce them. Run a hundred classifications, observe thirty errors, route those thirty to inbox. Tomorrow run another hundred and get the same thirty errors of the same shape. The wrapper produces inbox cards forever for the same situations. The operator’s attention burns. Nothing converges.
What’s missing is an accounting rule. The substrate has finite human attention. Every inbox card and every andon-cord moment is a withdrawal from a non-renewable account. The principle is one line: human attention must produce a durable change. If a card just gets acked, or an andon cleared, without moving the substrate forward, that attention was wasted and the same case will surface again. The frontier of substrate knowledge advances toward known-known when each withdrawal moves a case across it; without that discipline the substrate accumulates the same cases on repeat.
This is older than LLMs. Engineers call it remediation. Response puts out the fire. Recovery rolls back, patches, redeploys. Remediation builds the remedy into the environment so the same fire can’t start the same way again. Teams that stop at recovery watch the outage recur next quarter under a different symptom. The wrapper without remediation is recovery on a treadmill: every catch a patch, every patch fades, the same shape reappears. The change has to land structurally, in code the next call doesn’t have to re-derive.
Act II — the mechanism
The core move: move every stable distinction out of the prompt. Each observed mistake teaches the substrate one new structural artifact. Preconditions reject malformed input at the door; postconditions catch malformed output before it ships; CLI tools hand the model deterministic ground truth instead of letting it guess; caches make repeats free; schema enums close the legal output space. Five strata, each cheaper than a model call, each compounding. The skill gets shorter as it gets smarter, because each extracted layer is one the model no longer has to reconstruct.
The construction is an expert system in the strict sense, with the LLM occupying the residual cell where the rule-based core used to bottom out. MYCIN and CLIPS ran the same shape decades ago with a thin uncertain core; the modern LLM-wrapper tooling (function calling, Outlines, LMQL, Guardrails AI, DSPy Assertions) ships the surface validators and constrained decoders. The title’s expertise is the expert-systems term of art. What none of the prior work pushes as the load-bearing claim is the asymmetry that follows from doing this continuously: over time the deterministic shell should absorb every crisp regularity, until the model only handles the irreducibly fuzzy. The tooling is half the story; the asymptote is the other half.
Walk the classifier. Sweep’s remit actor receives a “PR X changed state” card from the notification poller, fetches live PR state, and routes to one of qa, respond, comment-issue, investigate, or the human inbox. The operator used to do this by hand, dozens of times a day. Walk the encoding through each stratum, cheapest first.
Identity (precondition). Is this even classifiable? Half the cards remit pulls turn out to be issues, not PRs. GitHub’s Could not resolve to a PullRequest surfaces this immediately. Encode it as a precondition: a card whose number resolves to an issue gets a skip → not_a_pr event and never sees the classifier. Zero tokens, zero latency, zero rate-limit share. Origin: an andon fired downstream when qa tried to check out a branch that didn’t exist because the PR was an issue. The false-known got named (“we assumed all notifications are PRs”); the assumption became one branch in code.
Legal moves (postcondition). What buckets exist? Five. Not six, not “qa-but-fast,” not “kinda-shipped-maybe.” The postcondition is a hard enum; anything outside it routes to the human inbox with reject_reason: unrecognized_bucket. Schema as poka-yoke. Origin: an inbox card with bucket “shipped-with-caveats” surfaced once; the operator named the assumption (“model treats the enum as advisory”); the postcondition became the wall. Cost: one literal in code.
Live state (CLI tool). Is CI passing? Who commented last? Is the branch on remote? The model used to derive these from prose context, badly. Encode each as a CLI tool the actor calls before the model fires. gh pr view returns the live state in fifty milliseconds; the model receives the structured answer, not a fragment to reason over. Every CLI tool retires a class of false-known the model could otherwise have generated. Origin: many andons. “Model said shipped, branch wasn’t there” → tool. “Model said no engagement, last comment was eight hours ago” → tool. “Model said CI passing, the rollup said failing on the very check that triggered the notification” → tool, plus a trigger-override rule for the race window where the live fetch hasn’t caught up to the upstream signal.
Repeats (cache). Key the routing decision by (repo, pr, state-hash). Same state, same bucket, free. The cache means the LLM only fires when the PR actually moved; a notification storm (one PR, ten reviewers, ten notifications inside a minute) collapses to one classification. Origin: the operator noticed the LLM had routed the same unchanged PR three times in a row with two different buckets, the rawest form of “the answer isn’t a function of the input.” An inbox card surfaced it. The cache made the answer a function of the input again.
Residue (LLM nucleus). What’s left is the irreducibly fuzzy. Is this maintainer hostile or just terse? Of two PRs that fix the same issue, is ours cleaner? Should this artifact’s recommendation be trusted given the operator’s prior calibration? These have no deterministic ground truth; they’re vibes-comparisons. The LLM earns its keep here, on a much smaller question than “what do I do with this PR.” The prompt is short. The model’s surface area is small. The residue is judgment, not re-derivation.
That is one skill, encoded. Five layers, four moved out of the prompt into substrate that pays no model freight, one left as the LLM nucleus:
| Stratum | Encoded as | Cost per call | Retires |
|---|---|---|---|
| Identity | precondition at inbox | one code branch | misclassification of off-shape input |
| Legal moves | postcondition enum | one literal | invented buckets |
| Live state | CLI tool | ~50ms call | facts the model would otherwise guess |
| Repeats | input-hash cache | JSON file read | re-derivation on identical input |
| Residue | LLM nucleus | full model call | (nothing; this is what’s left) |
At runtime the strata form a cascade. An event descends through them cheapest-first, and most events exit before they ever reach the model:
The skill’s runtime has three outputs, not two. Certain (ship the bucket). Ambiguous (the model says it isn’t sure, or the postcondition catches a shape it can’t validate). False-known (the model was confident and wrong; surfaces downstream when an invariant trips). Two of these route to human attention: ambiguous to inbox, false-known to andon. Both feed the encoding loop from different sides. The ambiguous branch is what ML calls a classifier with reject option (Chow, 1970): when confidence falls below threshold, abstain instead of guess. The ambiguous-to-inbox route is the standard human-in-the-loop (HITL) escalation, and the supervisor’s selection of which patterns to ask the operator about is a form of active learning sample selection.
Inbox cards teach the skill what it doesn’t know. The operator disambiguates; the substrate gains a precondition, a tighter enum, or a new cache key. Andons teach the skill what it thought it knew and didn’t. The operator names the broken assumption; the substrate gains a CLI tool, a postcondition, or a leak counter. Without the andon channel the encoded shell only grows from the humble side; the cocky side keeps shipping wrong buckets confidently and the false-known state accumulates silently. The wrapper from Make No Mistakes provides the andon channel; the human-attention principle ensures both channels’ withdrawals produce structural artifacts; encoding is where those artifacts land.
The inner loop is a three-output function whose two non-certain outputs feed an encoder that modifies the function itself. The deterministic shell grows monotonically toward the problem’s shape; the LLM nucleus shrinks toward the residue. Run the procedure through the same skill twice and the second pass changes nothing until a new shape of mistake arrives. Run a hundred of one input shape through the encoded skill and near-zero of them reach the LLM after the first few; run the same hundred through the bare call you started with and count the drift.
That is the unit. One classifier, one operator, one encoded shell that grows from observed mistakes. But the operator is still doing the encoding work by hand. Every inbox card disambiguated, every andon named, is a small decision about how to encode the mistake. After enough cards, those decisions themselves form a pattern. The next hoist is the human: replace the operator with a supervisor that watches both attention channels and ships encoding updates automatically. That’s the next post.