How AI Memory Systems Break at Scale | Tenure Platform<br>Project Resume Memory Modes PR Review<br>Editor Integrations<br>VS Code VSCodium<br>Compatible Clients<br>Mobile Open WebUI OpenClaw<br>Teams<br>Teams Overview Shared Memory AI Governance EU AI Act Compliance<br>Resources<br>Docs Writing Benchmark Paper GitHub
Install Free
Writing › Scale<br>Architecture How AI memory systems break at scale
The failure modes are structural, not incidental. Similarity search accumulates noise faster than<br>any model can filter it. Here is exactly what breaks, and how we designed around each failure.
Tenure research · ~12 min read
TL;DR<br>At small scale, frontier models can filter retrieval noise. At thousands of beliefs, that safety net disappears entirely.<br>Vector similarity cannot discriminate between beliefs that share a domain but differ in relevance. This is a geometry problem, not a capability problem.<br>Multi-turn sessions compound the failure: beliefs from off-topic turns contaminate re-entry queries with drift scores of 0.92 to 1.0.<br>Ingestion latency creates a structural availability gap: beliefs introduced mid-session may not be queryable until the session has ended.<br>The fix is not a better embedding model. Precision across a 20x range in model scale stays at 0.09. The fix is a different retrieval signal.
The hidden assumption Memory systems are tested at the wrong scale
Every memory system for LLM agents looks adequate in demos and early sessions. The corpus is small,<br>the frontier model is capable, and the model compensates for imprecise retrieval by reasoning through noise.<br>This works until it does not.
The field has converged on benchmarks that operate at tens to low hundreds of beliefs.<br>At that scale, a system that returns its entire store achieves recall of 1.0 and scores competitively<br>on answer-quality metrics, because a capable model can locate the correct answer in a noisy context window.<br>The precision problem is invisible at the scale where everything is tested,<br>and fully visible at the scale where everything breaks.
Serious persistent memory use reaches thousands of beliefs. Full-corpus retrieval becomes architecturally<br>impossible. The precision problem can no longer be offloaded to inference, and the failure that was<br>invisible in evaluation surfaces immediately in production.
The generative model was never a neutral downstream consumer.<br>It was load-bearing infrastructure compensating for retrieval imprecision.<br>That load-bearing role cannot scale with the store.
Failure mode 1 Cosine similarity cannot discriminate within a domain
In any belief store where the user works within a technical domain, all beliefs about that domain<br>occupy a shared semantic region. A query about Redis is semantically close to the Redis belief you want,<br>and equally close to beliefs about MongoDB, TypeScript, Kubernetes, Fastify, and GitHub Actions.<br>Cosine scores across these range from 0.65 to 0.83: genuine semantic relatedness that is measuring the wrong thing.
The predictable response is to reach for a more capable embedding model. We tested three,<br>spanning a 20x range in scale: a 768-dimension model, a 1024-dimension model, and an 8-billion parameter<br>model producing 4096-dimension embeddings. Mean retrieval precision was 0.09 across all three.<br>The qwen3 result is the clearest demonstration that this is not a capability problem.<br>At over 1,100ms mean per query, it produced identical precision to the smallest model.
Embedding model Dimensions Mean precision Active retrieval passes Mean latency nomic-embed-text 768 0.09 0 / 48 43ms mxbai-embed-large 1024 0.09 0 / 48 96ms qwen3-8b 4096 0.09 0 / 48 1,131ms<br>Precision is invariant to embedding model scale. All 11 total passes in every configuration are structural<br>or trivially empty cases. Zero active retrieval passes across all three models.
A more powerful embedder distributes scores differently across the corpus but cannot eliminate genuine<br>semantic proximity within a domain-specific corpus. The fix is not a better ruler.<br>It is a different measurement instrument entirely.
Failure mode 2 Extraction quality does not predict retrieval precision
One of the more counterintuitive findings from our evaluation is that faithfully extracted beliefs<br>can still fail at retrieval. The extraction pipeline and the retrieval pipeline are architecturally<br>decoupled, and precision failures occur in the retrieval layer regardless of what the extraction layer did.
Consider a concrete case from PrecisionMemBench. A relation-type belief linking an auth service<br>to a Redis dependency was ingested through Mem0's extraction pipeline. The stored memory preserved<br>every operationally significant fact: the service name, the dependency target, the fail-open behavior,<br>and the coupling assertion. High-quality extraction by any measure.
Stored in Mem0 after extraction<br>User's auth service depends on Redis for session storage.<br>If Redis goes down, auth fails open by denying all requests.<br>Auth resilience discussions must address...