Embedding Pipe

Part of the cognition series. Builds on The Parts Bin.

The pipe

An embedding pipeline processes items through six stages. Each stage has a contract. The machine-readable catalog is in _data/parts-bin.yml under data_structure: embedding_space.

Stage	What it does	Common implementations
Perceive	Produce vectors from raw input	CLIP, sentence transformers, contrastive learning
Cache	Maintain a searchable index over vectors	HNSW, IVF-PQ, ball tree
Filter	Retrieve a candidate set, strictly smaller than the index	c-ANN search, ε-approximate range search
Attend	Rerank and diversify under a budget	MMR, k-center / farthest-first traversal
Consolidate	Update the embedding model or retrieval policy from outcomes	Triplet-loss fine-tuning, online k-means, Growing Neural Gas
Transmit	Persist artifacts across runs (index, model, metadata)	FAISS index serialization, product quantization, checkpoint save

Cache builds and searches the live index. Transmit persists it to disk so the next run doesn’t start from scratch. Cache is the in-memory structure. Transmit is the durable snapshot.

Filter grid

Embedding space Filter, selection semantics × error guarantee. The similarity row is the strongest. Predicate and dominance are secondary. Causal is an open research direction.

	Exact	Bounded	Probabilistic
Similarity	Exact k-NN	c-ANN (cover tree, HNSW)	LSH ANN
Predicate	Metric range search	ε-approximate range search	LSH range query
Causal	Embedding causal filter: interference via embedding-distance kernel, FDR control via BY. Nearest prior art: Leung 2022, Duan et al. 2024.

Attend grid

Embedding space Attend. The top-k row is where agents spend most of their time.

	No diversity	Implicit	Explicit
Top-k slate	k-NN retrieval	Beam search	MMR, k-center / farthest-first
Single best	1-NN	Medoid	Farthest-point sampling

Example: article feed

A concrete embedding pipe for surfacing fresh articles from an RSS-like feed:

Perceive: embed each new article with a sentence transformer.
Cache: add to an HNSW index.
Filter: for each candidate, compute distance to nearest existing article in the corpus. Reject if below a novelty threshold (density-based filtering).
Attend: from survivors, pick top-k by relevance × diversity using MMR. The similarity penalty is cosine distance in the embedding space.
Consolidate: track which articles the user reads. Fine-tune the embedding or adjust the novelty threshold.
Transmit: serialize the HNSW index and read-history to disk.

The Filter step inverts the usual ANN query: instead of finding items close to a query, it rejects items close to existing coverage. The Attend step is standard MMR. The Consolidate step closes the loop.

How embedding space differs from flat

Flat pipelines process records by attribute. Embedding pipelines process records by position in a learned space. The dominant primitives shift:

Cache: key-based lookup → geometric index (HNSW, IVF)
Filter: predicate scan → proximity retrieval (ANN)
Attend: score-based ranking → coverage-aware diversification (MMR, k-center)
Consolidate: parameter update → space reshaping (metric learning)

Embedding pipelines also use predicate filtering (metadata filters alongside ANN), but the geometric operations are the ones that distinguish the pipe.

Written via the double loop.