Make No Mistakes

Sequel to Skills Lack Determinism and Don’t LLM What You Can Code.

Agents fuck it up all the time. They:

You can’t stop them fucking up. The worker is a language model, and the output distribution shifts under cost, prompt drift, and weather.

The field has tried. Most attempts to fix this work on the model, making the agent itself more reliable. Here’s the rough inventory of what’s been tried and where each falls short:

Every attempt above tries to make the model output more trustworthy. More training, more critics, more sampling, more reasoning. The implicit theory: an unreliable worker can be made reliable from the inside.

Step back. Are agents different from humans, reliability-wise? Humans hallucinate (we call it misremembering). Humans capitulate under social pressure. Humans skip steps and claim they didn’t. Humans write final-FINAL-v2.docx. Humans take destructive actions on routine-looking prompts. The error spectrum is shockingly similar; the rates differ on which axis, but neither party is reliable on its own.

So how do we ever have factories that ship near-defect-free products daily? Not by hiring more careful humans. Not by training the workers to want what management wants. By building a process (kanban, andon, jidoka, poka-yoke, standardized work, supplier separation) that makes individual unreliability irrelevant at the aggregate. The worker on a Toyota line in 1985 was no more virtuous than one in a 1970 Detroit plant. The output was orders of magnitude better because the process caught what the worker missed.

Whose advice should we take on reliability, then? Code artisans whose method is “be more careful” and vibe coders whose method is “ship it, see what breaks”? Or the lineage that solved this exact problem at scale, in physical reality, with workers who had hands and got tired and made mistakes and were paid not to be replaced by ones who didn’t?

You can’t make the model trustworthy. You can make its boundary observable. Build a wrapper that makes every fuck-up loud, local, and recoverable. The fuck-ups don’t go away; they stop cascading.

This post is nine transplantable patterns for that wrapper. Written first for agents that will drop these into another codebase: each trick is stated as invariant + mechanism + failure-prevented, no language, no library. Written second for curious humans: the patterns aren’t new, just newly applied. Their lineage is two of the deepest answers anyone has given to how do you build reliable systems out of unreliable parts, and they were sitting there waiting for the moment the unreliable part became a language model.

A skill is a prompt that shells out to a model. A skill actor is the durable worker that wraps one. It owns observability, throttling, rejection routing, and recovery. The skill writes prose. The actor keeps the line running.

The same answer shows up in the most reliable software we have. Ericsson’s AXD301 telecom switch hit nine 9s of uptime (about 31 milliseconds of downtime per year) running Erlang/OTP. WhatsApp served 450 million users with around 50 engineers and ran 2 million TCP connections per server on the same stack; that’s what Facebook bought in 2014. Discord moves billions of messages a day on Elixir, which is Erlang’s runtime with nicer syntax. Klarna, Goldman Sachs’s pricing systems, Bet365’s live trading: same lineage. None of these are research projects. They are the production systems other production systems envy.

What’s under the hood is the actor model: encapsulated processes, message-passing mailboxes, let it crash, and supervision trees where the supervisor never does the work it supervises. Joe Armstrong’s PhD thesis was titled Making reliable distributed systems in the presence of software errors. The premise was that errors are not exceptional; they are continuous, and the architecture must assume them. Same premise as Toyota’s. Different industry, same answer.

Two lineages converge here. Hewitt’s actor model (1973) and its industrial-strength descendant Erlang/OTP. Toyota’s Production System (1948 onward) and its global descendants. Invented forty years apart by people who never read each other, applied to wildly different substrates (humans on a line, processes on a VM), and they converge on the same primitives:

Toyota Production SystemActor model / Erlang-OTPShared principle
Kanban pull signalMessage send to mailboxDemand-driven flow, no upstream push
WorkstationActor / processEncapsulated state, no shared memory
Andon cordSupervisor-detected crashAnyone can stop the line; failure is loud
Jidoka (autonomation)“Let it crash” + supervisor restartDefects halt the unit, not the system
Poka-yokePattern-matched message contractWrong-shape input can’t enter the next stage
Genchi genbutsuPer-process logs, no shared mutable stateGo to the actual site to see ground truth
Heijunka (leveling)Per-actor quota / budget shareSmooth load across roles, no starvation
Reject bin (defective vs wrong-part){error, Reason} tupleFailures carry their cause forward
Standardized workgen_server / gen_statem behavioursCommon skeleton, local variation
Supplier separationProcess isolationOne station’s failure doesn’t corrupt the next

Both answer the same question: how do you build a reliable system out of unreliable parts that must keep working while parts fail? Skill actors add a third unreliable part, a language model whose output distribution shifts under cost, prompt drift, and weather.

The math textbook is the third witness. A pipeline is a composition of skills; composition is only well-defined when each skill is a monoid endomorphism on the shared cache: associative under chaining, with an identity (the no-op skill), and idempotent enough that skill(skill(x)) == skill(x) after a small bounded number of passes. Without that fixed-point contract, composition becomes a drift operator: every additional skill in the chain moves the artifact further from where any single skill would have left it, and the pipeline’s behavior is no longer a function of its parts. The category-theoretic name is a reflective subcategory of converged artifacts; the practical name is the pipeline doesn’t diverge when you add a step. TPS calls this standardized work. OTP calls it the gen_server contract. Mathematicians wrote it down a century earlier and called it associativity. Three independent derivations of the same primitive.

The nine tricks below are what an actor must provide so the monoidal contract on its skill holds under an unreliable model.

1. Artifact-first, stdout-shim-second, heuristic-third

Invariant: the skill’s decision lives in a written artifact, not in stdout. Mechanism: skill writes a file (attestation, decision JSON, hypothesis graph). Actor reads the file. On missing file, fall back to a cheaper model that extracts the schema from stdout. On that failing, fall back to a regex on the tail. Prevents: model paraphrase, prepended preambles, and cost-driven truncation from being misread as a decision. Artifact existence is binary; stdout always parses into something.

2. Subprocess blind spots: the PATH shim

Invariant: every external API call made by anything the actor spawns is attributed to the actor. Mechanism: prepend a shim directory to the spawned process’s PATH. Inject the actor’s identity into its environment. The shim, named identically to the real tool, records the call against the actor and execs the real binary. Prevents: subprocesses-of-subprocesses (typical when a skill calls another CLI that calls APIs) vanishing from rate-limit accounting. Counts at the receiving dock; the supplier’s claim doesn’t count.

3. Per-actor budget shares via ambient identity

Invariant: each actor consumes only its declared share of a shared rate limit. Mechanism: each actor declares a percentage of the global limit. Current actor identity travels through the call stack as an ambient context value (contextvar, AsyncLocalStorage, process dictionary, whatever the runtime offers), not as a threaded argument. The PATH shim from trick 2 and every wrapped client read this ambient. Prevents: whoever-runs-first-eats-most under shared limits. Attribution without plumbing.

4. Three-tier jidoka with hysteresis

Invariant: the line throttles before it stops, and recovers below where it stopped. Mechanism: three thresholds on a single utilization signal: throttle (slow intake), andon (pause the line), recover (clear the andon). recover < andon by a deliberate dead band. Prevents: flapping at the boundary. Without hysteresis, automatic recovery oscillates around the trigger.

5. The supervisor paradox

Invariant: the loop that can wedge cannot be the loop that unwedges it. Mechanism: the auto-clear runs on an independent daemon. It forces a fresh read of the gating signal, re-evaluates, and clears the andon when the signal allows. Prevents: a worker stuck inside a long call from firing its own recovery. The worker’s failure mode is precisely what disables its recovery. First instinct is always to put the recovery check at the top of the same loop. Resist.

6. Inbox-boundary pause, not mid-flight kill

Invariant: pausing the line stops new pulls; in-flight work completes. Mechanism: the pause check fires before the actor pulls the next message from its inbox. Skills that started, finish. New messages stay queued. Prevents: half-written artifacts, unacked messages, ambiguous external state. Stop issuing kanban cards; don’t shoot the operator mid-motion. Pause is not kill.

7. Rejection as a third outcome

Invariant: decided | errored | rejected are distinct. None coerces into another. Mechanism: a skill returns one of three. rejected carries a reject_reason (“wrong-shape input,” “missing context,” “out of scope”). Rejections route to a separate inbox for human review; they are neither acked-and-forgotten nor counted against error budgets. Prevents: rejected jobs poisoning either the decision stats (counted as “no”) or the alerting (counted as errors). Defective-part bin and wrong-part bin sit on different shelves for a reason.

8. Interface accounting: every drop has a reason

Invariant: for every interface boundary, in == out + screened + pending. Leak is the gap; target is zero. Mechanism: instrument each boundary with five counters: in, out, screened (filtered with a recorded reason), pending (still in an inbox), leak (the residual). Alert on nonzero leak. Prevents: silent message loss. Slow processing is not loss. Reasoned filtering is not loss. Unexplained disappearance is, and an unexplained gap is the only alert worth firing on the funnel.

9. Compliance counters that close the shim loop

Invariant: the inspection step does not silence upstream pressure. Mechanism: the shim from trick 1 emits three counters per skill: clean_fast (parsed without help), normalized (shim had to extract), fallback (heuristic kicked in). Surface them as percentages with sample size on the same dashboard as the skill’s other metrics. Prevents: the shim hiding skill drift. Without the counters, normalized output masks producer-side regressions; with them, growing normalized flags drift and growing fallback flags shim failure. The shim hides the bug; the counters surface the hiding. Both are necessary.


Stack the nine and you have a skill actor: a durable worker that observes its own boundary, throttles before it stops, recovers from outside itself, distinguishes rejected from errored, accounts for every message, and surfaces its own drift. That object is the prerequisite. The actor side underwrites the monoidal contract on the skill side: artifact-first (1) is what makes a skill’s output replayable, rejection-as-third-outcome (7) is what keeps the identity element pure, interface accounting (8) is what makes associativity checkable. Without the actor, the contract is a wish. With it, the contract is enforced.

Once the unit holds, composition follows. Two skill actors chained share a mailbox; their joint behavior is a skill actor. The fixed-point property carries through composition because each component had it. Pipelines are monoids of skill actors. Larger systems are monoids of pipelines. The whole stack is associative all the way up, which is why you can rearrange stages without rewriting the runtime, and why the same supervisor tree works at every scale.

And here is the prestige. The list of things people ship as their next AI-in-production initiative (observability, cost attribution, rate-limit governance, drift detection, audit trails, replayability, blast-radius containment, graceful degradation, backpressure, idempotency, fault isolation, recovery without lost work) is already present in the composed object. Not as add-ons. As consequences of the contract. The artifact-first invariant is the audit trail. The PATH shim is cost attribution. Budget shares are rate-limit governance. Three-tier jidoka is backpressure and graceful degradation. The supervisor paradox is fault isolation. Inbox-boundary pause is recovery without lost work. Rejection-as-third-outcome is drift detection at the input. Interface accounting is observability. Compliance counters are drift detection at the output. Every item on the production wishlist is already paid for at the unit; composition inherits them all because monoids preserve their generators’ properties.

You don’t bolt production-readiness onto a pipeline of skill actors. You get it the moment the unit is correct. That is the difference between an engineering primitive and a feature checklist, and it’s why three lineages that never spoke converged on the same one.

The composition doesn’t stop there. The monoidal contract is scale-free: a skill actor is a monoid, a pipeline of skill actors is a monoid, a fleet of pipelines is a monoid, an organization of fleets is a monoid. The same associativity, the same identity, the same fixed-point under repeated application, at every level. There is no scale at which the primitive stops working and a new architecture has to take over. Toyota proves this in physical reality: one workstation, one cell, one line, one plant, one global production system, all running the same kanban-andon-jidoka discipline that holds on a single station. OTP proves it in software: one process, one supervision tree, one node, one cluster, all running the same let-it-crash discipline. The skill actor inherits the same property because the contract is the same. Whatever you compose, you can compose again. The ceiling is set by the substrate, not the primitive.

Nothing here is invented. The Toyota Production System, Erlang/OTP, and basic algebra each derived the same primitive from a different starting point, separated by decades and disciplines, and they agree because the problem is the same: build a reliable composition out of unreliable parts. The worker now happens to talk. That’s the only update. Take their advice over vibes. Get the unit right and the pipeline falls out.