Embedding Pipe
Part of the cognition series. Builds on The Parts Bin.
The pipe
An embedding pipeline processes items through six stages. Each stage has a contract. The machine-readable catalog is in _data/parts-bin.yml under data_structure: embedding_space.
| Stage | What it does | Common implementations |
|---|---|---|
| Perceive | Produce vectors from raw input | CLIP, sentence transformers, contrastive learning |
| Cache | Maintain a searchable index over vectors | HNSW, IVF-PQ, ball tree |
| Filter | Retrieve a candidate set, strictly smaller than the index | c-ANN search, ε-approximate range search |
| Attend | Rerank and diversify under a budget | MMR, k-center / farthest-first traversal |
| Consolidate | Update the embedding model or retrieval policy from outcomes | Triplet-loss fine-tuning, online k-means, Growing Neural Gas |
| Remember | Persist artifacts across runs (index, model, metadata) | FAISS index serialization, product quantization, checkpoint save |
Cache builds and searches the live index. Remember persists it to disk so the next run doesn’t start from scratch. Cache is the in-memory structure. Remember is the durable snapshot.
Filter grid
Embedding space Filter, selection semantics × error guarantee. The similarity row is the strongest. Predicate and dominance are secondary. Causal is an open research direction.
| Exact | Bounded | Probabilistic | |
|---|---|---|---|
| Similarity | Exact k-NN | c-ANN (cover tree, HNSW) | LSH ANN |
| Predicate | Metric range search | ε-approximate range search | LSH range query |
| Causal | Open: geometry-aware interference estimation (Leung 2022) and FDR-controlled causal selection (Duan et al. 2024) exist separately. No known composition for embedding-distance-defined interference with bounded FDR. | ||
Attend grid
Embedding space Attend. The top-k row is where agents spend most of their time.
| No diversity | Implicit | Explicit | |
|---|---|---|---|
| Top-k slate | k-NN retrieval | MMR | k-center / farthest-first |
| Single best | 1-NN | Medoid | Farthest-point sampling |
Example: article feed
A concrete embedding pipe for surfacing fresh articles from an RSS-like feed:
- Perceive: embed each new article with a sentence transformer.
- Cache: add to an HNSW index.
- Filter: for each candidate, compute distance to nearest existing article in the corpus. Reject if below a novelty threshold (density-based filtering).
- Attend: from survivors, pick top-k by relevance × diversity using MMR. The similarity penalty is cosine distance in the embedding space.
- Consolidate: track which articles the user reads. Fine-tune the embedding or adjust the novelty threshold.
- Remember: serialize the HNSW index and read-history to disk.
The Filter step inverts the usual ANN query: instead of finding items close to a query, it rejects items close to existing coverage. The Attend step is standard MMR. The Consolidate step closes the loop.
How embedding space differs from flat
Flat pipelines process records by attribute. Embedding pipelines process records by position in a learned space. The dominant primitives shift:
- Cache: key-based lookup → geometric index (HNSW, IVF)
- Filter: predicate scan → proximity retrieval (ANN)
- Attend: score-based ranking → coverage-aware diversification (MMR, k-center)
- Consolidate: parameter update → space reshaping (metric learning)
Embedding pipelines also use predicate filtering (metadata filters alongside ANN), but the geometric operations are the ones that distinguish the pipe.
Written via the double loop.