We reduced RAG retrieval cost 10× with a hippocampus-inspired memory substrate

We Built a Memory Engine. The Brain Told Us How. - BricbyBric The kind of reading that starts at 2am and ends with you questioning why every retrieval system in AI works nothing like this. Almost every RAG pipeline in production today follows the same pattern: dense vectors, nearest-neighbor search, pull in as much context as possible, and hope the LLM figures out what is relevant. It works, but it is expensive by design and it has little to do with how the brain actually retrieves a memory. So we built something that does.

Sparse Codes 40 Active Bits. 8,192 Slots. In 1971, John O'Keefe discovered that certain neurons in the rat hippocampus fire only when the animal is in a specific location. Not when it sees something or hears something. Only in a particular place. What this eventually revealed is how memory gets encoded: as a sparse pattern, a small number of neurons active at once out of a vast pool of silence. The math is elegant. If you have 8,192 neurons and only 40 fire at once, two random patterns share about 0.2 bits by chance. Any non-trivial overlap is signal, not noise. The sparsity is the point. Hippocampus stores facts the same way. Each fact becomes a 40-active-bit binary vector out of 8,192. Retrieval works through lexical seeding into a typed relational graph. There is no embedding model at query time, which means there is no embedding cost at query time. The efficiency comes from the architecture, not from tuning. That distinction matters when you are thinking about whether the numbers generalize.

Results What We Found SystemCF AccuracyTokens / AnswerHippocampus90.91%~12MiniLM-filtered77.27%~121BM2531.82%~495 CF stands for contradiction-free: correct answer, no contradicting claims introduced. We use this instead of top-1 accuracy because in a production agent a confident wrong answer causes more damage than silence. On non-list-tail facts, which make up the majority of any real retrieval workload, Hippocampus reaches 94.74% CF versus MiniLM-filtered at 89.47%. Better accuracy at 10 times lower token cost. On list-tail facts, MiniLM-filtered scores 0% because its filtering step discards the relevant list context entirely. Hippocampus scores 66.67%. Every number here has a JSONL file behind it, pinned to a commit hash, reproducible from one command in the public repo.

Methodology We Ran It Like a Drug Trial In clinical trials you write down what would falsify your hypothesis before running the experiment. Then you run it and publish the result regardless, pass or fail. We do the same for every experiment in this project. Before any run we commit to exact acceptance bars: which specific facts must flip, what regression threshold counts as failure, and what ablation test confirms the mechanism is actually doing the work. Across dozens of acceptance checks so far the failure rate has held around a third. The failures are dated and root-caused in the project guide. Some are mechanisms we had named version numbers after before discovering they were doing nothing on aggregate metrics. They stayed on the record. This matters not as a philosophy but as a technical claim. When a bar passes, the commitment existed before the data.

Regression The Experiment That Produced a Regression We Did Not Expect Earlier this week we shipped a query-expansion fix. The problem was that our retrieval is lexical and natural English queries do not match template-canonical cell labels. "Where was X born" does not find a cell named "birth_place." We wrote the falsifier first: two specific birth-place facts must flip from wrong to correct, no more than one regression elsewhere, and disabling the fix must drop them back. Then we ran it. Two things happened that were not in the plan. A third fact we had not targeted also flipped, one we had written off as out of scope. We logged it as a bonus finding with its own root cause. We could not claim it as a win since we had not pre-committed to it. Then an unrelated fact broke. Appending a property token to a query that already matched the correct cell shifted the ranking and the wrong cell won. The net effect was still positive but the mechanism was not regression-free, so it could not go into the core SDK until we understood why. We traced the regression into the layer that resolves which version of a fact is current, found it, and made the surgical fix. Five bars pass on the follow-up run including byte-level determinism: same output across ten identical runs, standard deviation exactly 0.0000. Three more cycles followed that week. A past-tense verb regex closed one more fact on the strict per-fact no-regression bar. Promoting the ranking fix closed the original regression without touching the other 43 facts. During artifact prep we found a dataset error where one row had been anchored to an election-night Wikipedia revision that said "President-elect" rather than "President," making every system fail it for the wrong reason. We corrected it, documented it with a...

We reduced RAG retrieval cost 10× with a hippocampus-inspired memory substrate

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models