Generalize or Specialize? Retaining Reusable Skills for World-Model Agents

Position / survey draft. Agents that act in a world increasingly write their own skills, growing a library of reusable abstractions and planning over it. The hard part is not proposing a skill but deciding which to retain, a choice between generalization and specialization. Two research communities that rarely cite each other have each formalized one side under another name: compression (minimum description length) is a method for generalization, keeping what recurs; planning utility (Minton’s macro utility problem) is specialization, keeping the rare skill that pays off. Both are the same operation, cache eviction, the oldest problem in systems, where a bounded cache keeps either what is used most or what is dearest to recompute. Placed on one footing, the two criteria agree wherever reuse tracks search value, the regime standard Blocksworld occupies, and part only on rare-but-critical skills, the specialists compression cannot see. We map where they coincide and the corner where they diverge, and conjecture that long-horizon world-model agents, whose ever-growing libraries weigh nothing against carrying cost, inhabit that corner, where keeping only the general half is keeping the wrong half.

1. Introduction

Agents now write their own skills. Given an environment, a language model will propose, code, and store reusable behaviors, growing a skill library it plans over; Voyager is the canonical case. This was read as a glimpse of recursive self-improvement: an agent that accumulates its own capabilities should compound them. It has not proven so simple. A library that only grows is not a library that improves. The binding cost is not retrieval time, which a good index keeps cheap; Soar’s production match is near-logarithmic in library size. It is interference and capacity: a crowded library dilutes its own retrieval, the right skill harder to surface among near-misses, and for a language-model agent the skill index alone eats the context window in proportion to the library, even under progressive disclosure that loads bodies only on demand. Indexing makes lookup fast; it does not make a crowded library legible, nor shrink a manifest that grows against a fixed window. Were that cost zero, keeping everything would win and there would be nothing to decide; it is not, and a skill’s carrying cost falls on every query while its payback falls only on those that use it. Unbounded accumulation is self-defeating, and which skills to evict is the retention criterion. The hard part was never proposing a skill, which a language model does fluently, but deciding which to retain. This retention criterion is the organizing question here.

The choice it forces is between generality and specialization, and it is older and wider than language-model agents: HTN learners, program-induction systems, and hierarchical reinforcement learners all grow and prune the same kind of library under other names. Two criteria recur across these literatures, which rarely cite each other, and they are the two sides of that choice under other names: compression (minimum description length) is generalization, keeping what recurs, the broadly reusable skill; and planning utility (Minton’s macro utility problem) is specialization, keeping the rare specialist that cracks one hard case and is invisible to frequency. The contribution is the map of where they agree and where they part, and the split is, mechanically, cache eviction: what a library keeps versus what it evicts.

This map reads five lineages through one lens, the propose-then-keep decomposition: macro-operator and explanation-based learning, grammar induction, HTN-method learning, hierarchical-RL option discovery, and program-library and LLM skill-library learning.

Contributions. (1) a common notation for abstraction-library learning over a world model; (2) a taxonomy comparing HTN learning, grammar induction, HRL option discovery, and program-library learning by their proposal mechanism and selection pressure; (3) a controlled experiment that reconciles two literatures at the level of the computation, mapping where the compression and utility retention criteria empirically coincide and where they diverge; (4) the gaps the map reveals and an open problem: whether long-horizon world-model agents operate in the divergence corner, where their ever-growing skill libraries would need a utility retention pressure rather than compression.

2. Problem Formulation: Abstraction as Library Learning

Let D = {τ₁…τ_N} be a corpus of successful behavior traces, each τ = (s₀, a₀, s₁, …, s_T) generated under a world model M : S × A → S over primitive actions A. By world model we mean any predictive model that supports counterfactual evaluation of action sequences: symbolic transition rules, a learned neural simulator, a DSL interpreter, or an LLM-mediated state predictor.

A candidate abstraction c has an interface (a task head, option, program type, or macro name), an applicability condition pre(c), an expansion body(c) in primitives or other abstractions, and a carrying cost κ(c) for storing, matching, and maintaining it. A library L is a set of abstractions; given L, each trace rewrites to a derivation z ∈ Rewrite(τ; L) that reconstructs τ under M. Write the rewritten corpus Z_L(D).

Three objective families recur for choosing L:

Compression / MDL:   L* = argmin_L  [ bits(L) + bits(Rewrite(D; L)) ]
Bayesian library:    L* = argmax_L  [ log p(D | L, M) + log p(L) ]      with  p(L) ∝ e^(−λ·K(L))
Use-time utility:    L* = argmax_L  [ E_q benefit(q; L, M) − cost(L) ]

They are three ways of pricing an abstraction against its carrying cost: representational length, posterior probability, or expected use-time value. But only the utility criterion prices use-time value directly; compression and the Bayesian prior price encoding length and bet that it tracks value. The library is useful only insofar as it changes planning over M.

3. The Mechanism: Retention as Cache Eviction

The mechanism already has a name in a field that solved this problem decades before any of the lineages here and is cited by none of them. A library read far more often than it is written, with a budget on what it can hold, is a cache, and deciding what to retain is cache eviction. The two retention criteria are the two classic eviction families. Compression is LFU: evict what is accessed least, keep what recurs. Utility is cost-aware eviction, Greedy-Dual-Size and its frequency-weighted form GDSF, where an item’s “cost” is its miss penalty, the work redone without it, which is exactly the search a missing abstraction forces. Belady’s optimal policy, evict what is needed farthest in the future, is the ideal both approximate. Cognitive science split the same way: human memory tracks frequency and recency as a rational estimate of future need (Anderson and Schooler), yet also preferentially retains high-value items independent of how often they recur (Castel and colleagues), the two eviction criteria and their divergence in one memory system.

Caching has already produced this result. Cost-aware eviction beats pure frequency precisely when item costs are heavy-tailed, a few objects far dearer than the rest, and coincides with it when cost tracks frequency. That is the agreement-versus-divergence boundary the controlled comparison below makes precise, mapped in web proxies in the 1990s; caching learned the same lesson by negative example, as LFU prevailed until cost-aware policies displaced it on heterogeneous workloads. The abstraction-library setting adds one thing caching lacks: composition. Classical eviction (Belady, GDSF) prices independent items, but abstractions build on each other, so retention here is eviction over a dependency graph, where discarding one item changes the cost of the rest. This is the one point at which the analogy genuinely breaks, and it is open: the eviction literature has little to say about caches whose items compose. The retention criterion, and the condition under which its two forms part, carry over regardless.

4. Proposal Mechanisms: Generating Candidate Abstractions

The oldest proposal mechanism is goal regression. HTN-Maker (Hogg, Muñoz-Avila, and Kuter) regresses a goal backward through the suffix of a solved trace; the surviving conditions become a method’s precondition, the spanned actions its expansion. It required annotated tasks, and CURRICULAMA (Li, Nau, Roberts, and Fine-Morris, 2024) removed that by deriving the tasks as planning landmarks. Earlier HTN-by-observation systems make the proposal step explicitly observational: Nejati, Langley, and Könik’s, and CaMeL (Ilghami and colleagues, which learns method preconditions for a given structure).

A second mechanism is pattern mining: treat each trace as a symbol sequence and extract recurring substrings. Hérail and Bit-Monnot’s structure learner uses the GoKrimp algorithm (Lam and colleagues) to promote frequent patterns to synthetic tasks, consolidated in Hérail’s 2024 thesis. A third is program search: DreamCoder (Ellis and colleagues) searches a DSL for task solutions, then mines its own programs. A fourth, the crudest, is enumeration: Hérail and Bit-Monnot’s 2022 paper generated whole models by partitioning the action set.

None of this is new; it descends from macro-operator acquisition (Fikes, Hart, and Nilsson; STRIPS MACROPS), explanation-based learning (DeJong and Mooney; Minton’s PRODIGY), and Soar’s chunking of impasse resolution into rules (Laird, Rosenbloom, and Newell). The same shape recurs in grammar induction: SEQUITUR (Nevill-Manning and Witten) replaces repeated digrams with nonterminals, RePair (Larsson and Moffat) builds straight-line grammars by pair replacement, ADIOS (Solan and colleagues) induces significant patterns. The proposal step takes many forms but shares one tendency: it is generous, always offering more abstractions than are worth keeping.

5. Selection Pressures: Which Abstractions to Retain

Two keeping-pressures recur, generalization and specialization. A substantial subset of systems keep by compression, the generalist’s criterion. DreamCoder retains a routine only when it lowers the joint description length of library and programs; Stitch (Bowers and colleagues) makes the same selection roughly an order of magnitude faster. Hérail and Bit-Monnot score whole HTN models by an explicit MDL metric and mine patterns by GoKrimp’s most-compressing-first rule. In grammar induction the grammar size is the objective.

A second subset keeps by use-time utility, the specialist’s. Minton’s work on the utility problem in PRODIGY kept a learned control rule only when its estimated search-time savings beat its matching cost, the first explicit statement that learned structure has a carrying cost.

Much of hierarchical reinforcement learning uses neither. The options framework (Sutton, Precup, and Singh) defines temporally extended actions without a discovery rule; option-discovery methods then propose subgoals from bottlenecks (McGovern and Barto; Şimşek and Barto’s betweenness), spectra (Machado and colleagues’ eigenoptions), reachability (Konidaris and Barto’s skill chaining), diversity or empowerment (Eysenbach and colleagues’ DIAYN), or differentiable return (Bacon, Harb, and Precup’s Option-Critic). Only the description-length branch, PolicyBlocks (Pickett and Barto) and LOVE (Jiang and colleagues), matches the compression thesis, and only Jinnai and colleagues price planning time directly; the rest are genuine counterpoints. One caveat the table cannot fully capture: across much of this branch, proposal and selection collapse into a single objective. Options are induced and retained by the same discovery rule, with the count fixed in advance, so the “retention pressure” column here names the inducing objective rather than a separate retention step. The clean propose-then-keep split is itself more a property of the symbolic and program-library lineages than of the differentiable ones.

6. A Taxonomy of Abstraction Learners

The table is the map, and the choice of axes is the argument. The conventional cuts through this literature run by representation, symbolic versus neural, or by domain, planning versus reinforcement learning versus program synthesis; both keep the communities apart and hide the convergence. The claim here is that the load-bearing axis is neither, but the retention criterion. Along that axis, HTN learners, grammar inducers, and program-library systems share a cell, while methods that share a representation fall on opposite sides. The framing is the contestable part, and it is the contribution: an agent can list these systems, but deciding that the right cut is along retention pressure rather than representation is a claim that must be staked. Each row carries one proposal and one retention pressure; where a retention pressure is absent, the row records it.

LineageSystemWorld modelProposalRetention pressureType
Macro / EBLMACROPS (Fikes+ 1972)symbolicgeneralize solved plannone → utility problemproposal-only
Macro / EBLPRODIGY (Minton 1988)symbolicEBL on solved instancessearch utility − match costutility-explicit
Macro / EBLSoar chunkingsymbolicchunk impasse resolutionnone (architectural)accumulation
Macro / EBLSoar forgetting (Derbinsky, Laird 2013)symbolicchunkingbase-level activation + reconstruction costcompression + utility
GrammarSEQUITURsequencerepeated-digram replacementgrammar size (2 constraints)compression-explicit
GrammarRePairsequencemost-frequent-pair replacementgrammar sizecompression-explicit
GrammarADIOSsequencesignificant-pattern detectionstatistical significanceother-objective
GrammarGoKrimp (Lam+)sequence DBcandidate patternsmost-compressing (MDL)compression-explicit
HTNHTN-Makersymbolicgoal regression (annotated)subsumption / redundancyproposal + syntactic prune
HTNCaMeL (Ilghami+)symbolicprecondition learningn/a (structure given)proposal-only
HTNby observation (Nejati+ 2006)symbolicobserve executionsnone explicitproposal-only
HTNCURRICULAMA (2024)symbolicregression + landmarksnone new (unbounded growth)accumulation
HTNenumeration (Hérail+ 2022)symbolicpartition enumerationwhole-model MDLcompression-explicit
HTNstructure learner (Hérail+ 2023)symbolicGoKrimp + regressionMDL (per-pattern + model)compression-explicit
HRLoptions (Sutton+ 1999)MDPgiven / definedn/a (framework)framework
HRLPolicyBlocks (Pickett+ 2002)MDPshared policy fragmentsdescription lengthcompression-like
HRLbetweenness (Şimşek+ 2009)MDP graphcentrality subgoalsgraph centralityother-objective
HRLeigenoptions (Machado+ 2017)MDPLaplacian eigenvectorsspectralother-objective
HRLOption-Critic (Bacon+ 2017)learneddifferentiable optionspolicy-gradient returnother-objective
HRLDIAYN (Eysenbach+ 2018)learneddiversity skillsmutual informationother-objective
HRLLOVE (Jiang+ 2022)learnedvariational segmentationinfo cost on skillscompression-like
HRLmin-time options (Jinnai+ 2019)MDPoption-set searchplanning-time reductionutility-explicit
ProgramDreamCoder / EC (Ellis+)DSLprogram searchBayesian / MDL librarycompression-explicit
ProgramStitch (Bowers+ 2023)DSLcorpus-guided abstractiondescription lengthcompression-explicit
ProgramBPL (Lake+ 2015)generative programhierarchical partsBayesian priorBayesian-compression-like
LLM agentVoyager (Wang+ 2023)LLM / simLLM skills from feedbacknone (ever-growing)accumulation-without-pruning
LLM agentDEPS (Wang+ 2023)LLM / simLLM plan decompositionnoneaccumulation
LLM agentReflexion (Shinn+ 2023)LLMstored verbal reflectionnone (append)accumulation-without-pruning
LLM agentExpeL (Zhao+ 2024)LLMextracted insightsweak heuristicweak-keep

The pattern is not “everyone compresses.” Compression (or a Bayesian prior that behaves like it) dominates grammar induction, program induction, and the recent HTN line; explicit utility appears in EBL; HRL is split, with most option-discovery using other objectives entirely; and the LLM/world-model agents mostly have no retention pressure. The honest claim is the disjunction: durable libraries need some retention pressure, and these are the recurring forms.

7. A Controlled Comparison: When Generalization Suffices, and When Specialization Pays

The compression and utility branches are separate literatures. On a controlled domain they are directly comparable, and that comparison is the object of this section. Each candidate skill abstracts one task segment with two independent properties: a frequency f, the fraction of tasks needing it, and a hardness h, the segment’s blind-search cost (B^h model-rollouts if unabstracted). All segments share a description length, so an MDL retention rule’s gain is proportional to f alone; compression is blind to hardness by construction, while a utility retention rule scores f · B^h. This is a mechanism demonstration, not an effect-size estimate: a space of any significant size needs both kinds of skill at once, the frequent general abstractions compression keeps and the rare specialists only a utility rule sees, so a criterion that prices one alone leaves the other half uncovered. Equal description lengths make MDL exactly frequency-ranking, which isolates the mechanism; in practice a skill’s encoding length grows with its expansion, so real MDL is a noisy proxy for hardness rather than blind to it, and the correlation sweep below and Blocksworld restore the realistic case. Under a carrying-cost budget K (a larger library also raises the per-step matching floor B + |L|), MDL keeps the K most frequent skills, utility the K highest-f · B^h. Expected held-out planning cost is exact, with no Monte-Carlo noise.

With B = 4, 30 candidate skills, and budget K = 10 (frequency and hardness uncorrelated):

retention rulelibrary sizeheld-out planning cost
no-library010356
accumulate-all30265
frequency / MDL keep108739
utility keep10831

No library is far worse, by orders of magnitude. But a retention rule alone does not fix it: MDL and a naive frequency cutoff barely improve on no library, spending the budget on frequent-but-easy skills and leaving the rare-hard segments uncovered. Only a utility rule recovers, keeping the abstractions that actually cut search, an order of magnitude below MDL and close to the unbudgeted accumulate-all baseline at a fraction of its library size. The retention criterion, not the act of keeping, carries the result.

Whether that gap appears is conditional, and the condition is the contribution. Sweeping the correlation ρ between frequency and hardness against the budget K traces a phase boundary: utility’s advantage runs up to roughly 28× where hard skills are rare and uncorrelated with frequency, and collapses toward parity as ρ → 1, where the frequent skills are the hard ones and compression selects them anyway.

Heatmap of log10(MDL planning cost / utility planning cost) over frequency–hardness correlation (x) and library budget (y); bright where utility wins, dark on the right where the two agree.
Bright = utility beats compression. The advantage fills the rare-and-uncorrelated regime and vanishes on the right, where frequency tracks hardness and compression picks the hard skills for free. Exact expected cost, averaged over 16 skill populations per cell.

So compression is a sound retention criterion exactly where statistical regularity tracks search value, and it fails precisely on the rare-but-critical abstraction it cannot see. Minton’s utility problem and the MDL criterion are one picture: two prices on the same carrying cost, agreeing when frequency and difficulty align and diverging when they do not.

The same ablation in a standard planning domain confirms the boundary is real, not an artifact of the synthetic setup. In Blocksworld, with macro-operators mined from solved plans and planning cost measured as nodes expanded by greedy best-first search, the two retention rules nearly coincide and the gap widens over the tested sizes: utility beats MDL by 1.08× at five blocks, 1.18× at six, and 1.37× at seven, as deeper deadlocks make the rare search-saving macro matter more (mean expanded nodes with no library rise from 6.8 to 24.3 over the same range). Standard Blocksworld sits in the agreement corner, where frequency tracks search-value, and drifts toward divergence as the domain hardens. It is the kind of domain where a compression retention rule has served the symbolic lineages well. (Code, figure, and the Blocksworld harness: the mine-then-keep repository.)

A second confirmation comes from a deployed cognitive architecture, and it falls in the opposite corner. Derbinsky and Laird gave Soar’s learned rules a forgetting mechanism, since shipped in the architecture as production apoptosis: a rule is excised when its base-level activation drops below threshold and it can be reconstructed by re-derivation (a reinforcement-tuned rule, whose learned value cannot be regenerated, is spared). Those are the two criteria exactly, activation the frequency (compression) side and reconstruction cost the miss-penalty (utility) side. On Liar’s Dice, where rare reinforcement-tuned rules carry most of the value, forgetting by activation alone collapses competence from 75% of games won to below 55%; adding the reconstruction-cost criterion restores it at a fraction of the memory. That is the divergence corner in a real architecture, the synthetic result reproduced where the stakes are higher: frequency-only retention evicts the specialists, and pricing their recomputation brings them back. Standard Blocksworld sits in the agreement corner and Liar’s Dice in the divergence corner; together they bracket the phase boundary in deployed systems, not only the synthetic sweep.

8. Open Problems and Future Directions

The map makes three absences visible. The first is an empty column: every LLM and world-model skill-library agent in the taxonomy pairs the strongest proposal mechanism with no retention criterion at all. Voyager’s library is ever-growing by design, Reflexion appends, and none weighs an abstraction’s reuse value against its carrying cost. The agents with the richest world models are those with no mechanism for discarding what they learn, and a comprehensive survey of world models (ACM Computing Surveys, 2025) already names the abstract, high-level action layer as an open problem in its own right. That layer is a learned abstraction library: the field built the proposal half and left selection to accretion, with a recent wave of evolving skill graphs and self-improving libraries only beginning to treat management as a problem.

The field has already learned this lesson once. The utility problem Minton named in 1988, that a planner hoarding learned macros can spend more time matching them than they save, became a central reason macro-operator and explanation-based systems had to control what they kept. Soar’s chunking could produce “expensive chunks” whose match cost degraded performance (Tambe, Newell, and Rosenbloom). And learned-method sets can still grow without obvious payoff: CURRICULAMA reports method count and planning time climbing in Blocks World, Logistics, and Rover. Each case is the same shape: unmanaged accumulation turning a learned library from an asset into a tax. The result survives in the field’s collective experience more than in any single paper’s abstract, and accounts for much of why the older symbolic lineages developed a retention criterion at all: Soar itself, whose expensive chunks first showed the tax, later gained a forgetting mechanism that keeps the frequently used and the costly-to-reconstruct and discards the rest (Derbinsky and Laird). The skill-library agents have rebuilt the accumulation without that brake, and are positioned to relearn the result at scale.

The second absence is an untested assumption, and clearing it requires adjudicating two tempting readings. Compression rewards what recurs, but whether learned library items are reused at all is not automatic: a recent evaluation finds reported library-learning gains that trace to self-correction rather than reuse. Statistical regularity is a proxy for value, and the proxy can fail on its own terms. The second read is to take the bottleneck-option literature as already showing that rare skills carry outsized worth. It does not: diverse-density and betweenness subgoals are selected for sitting central or common on successful paths, so the work establishes leverage, not a rare-use, high-value separation. Only the line that prices planning time directly (Jinnai and colleagues) speaks to the frequency-value gap at all. So using bottleneck-option results as evidence for a rare-but-critical tail is a category slip; that research solves a different problem well.

The third absence is a methodological blind spot: the retention criterion is seldom ablated. The controlled comparison above is one of few, and it runs on a synthetic domain and Blocksworld, not a live agent.

These converge on one open problem, and it turns on a contingent fact: whether the value in an open-ended agent’s skills concentrates in rare-but-critical abstractions, the heavy tail where compression and utility part. The structure is generic rather than exotic. A passport is used a few times a decade and is indispensable on each, and a rule that keeps what you use often deletes it first. Value and frequency are independent axes, a point already settled for experience, where prioritized replay keeps the rare important transitions over the common ones, and settled earlier still in cognitive science: complementary learning systems theory (McClelland, McNaughton, and O’Reilly) holds that no single learner both extracts shared structure and retains sparse specifics, which is why the brain pairs a generalizing neocortex with a specific-storing hippocampus, the learning systems Kumaran, Hassabis, and McClelland argue an intelligent agent needs. The general library and the rare specialist are that pairing under a planning budget; planning utility is the retention criterion that catches the specialist, the objective Jinnai and colleagues optimize directly. Whether a real agent’s library inherits this shape is unmeasured, so we leave it open. But the analogy fixes the prior: a planner whose hardest sub-problems are rare bottlenecks, the one deadlock-breaking maneuver, the unlock that opens a region, learns its highest-value skills exactly where compression is blind. If so, the agents accumulating libraries today are keeping the wrong half.

Three steps would close it: extend the controlled comparison to a more combinatorial domain such as Logistics, and to larger Blocksworld, to trace the full agreement-to-divergence drift; measure the value-versus-frequency distribution of a real skill-library agent’s abstractions directly; and ablate retention criteria in a live world-model agent over a long horizon. Our own grid-puzzle agent, a learned simulator with a decomposition library grown by proposal-then-compression, is the intended vehicle for the last.

9. Conclusion

Two communities that rarely cite each other have been computing nearly the same library: generalization and specialization, under the names compression and utility, coincide wherever reuse tracks difficulty, the regime standard Blocksworld occupies, where either rule keeps the same library and each field succeeds with its own. They part only on the rare-but-critical abstraction, the corner we conjecture the newest agents occupy and the one their accreting libraries cannot keep. This map is one lens among others, and a survey’s framing can ossify a field as easily as clarify it; the lens earns its place only if it makes the retention criterion visible as a choice rather than an accident, and names the single measurement that would settle which choice long-horizon agents need.

10. References