GitHub - Kausha3/agent-memory-bench: An open benchmark for the failure modes of agent memory systems: retraction, collision, recall, conflict. Offline, zero-dependency, reproducible. · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
Kausha3
agent-memory-bench
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>2 Commits<br>2 Commits
.github/workflows
.github/workflows
src
src
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
TAXONOMY.md
TAXONOMY.md
package-lock.json
package-lock.json
package.json
package.json
tsconfig.json
tsconfig.json
View all files
Repository files navigation
agent-memory-bench
An open benchmark for the failure modes of agent memory systems.
Everyone shipping an AI agent bolts on a "memory," and everyone evaluates it the same<br>shallow way: did retrieval fetch a relevant chunk? But agents don't fail in the field<br>because retrieval missed. They fail because the fact they retrieved was stale ,<br>belonged to the wrong entity , was buried under noise , or contradicted another<br>fact the system also believed. Those are the bugs that make an agent confidently wrong.
agent-memory-bench scores those four failure modes directly — and it runs offline,<br>with zero dependencies and no API key , so the leaderboard is reproducible by anyone in<br>one command.
npm install<br>npm run bench # prints the leaderboard below<br>npm test # adversarial tests for the scoring core + baselines
Leaderboard
Reference baselines across 13 scenarios in 4 categories. Numbers are produced by<br>npm run bench — reproduce them yourself.
system<br>retraction<br>collision<br>recall<br>conflict<br>overall
typed-constraint<br>100%<br>100%<br>75%<br>100%<br>92%
keyword<br>0%<br>100%<br>75%<br>0%<br>46%
recency<br>100%<br>0%<br>0%<br>0%<br>23%
Read this as a map of where each strategy breaks, not a ranking of products:
keyword (similarity retrieval, no model of time) aces collision but scores 0% on<br>retraction and conflict — with no notion of time it happily returns the value the user<br>already changed.
recency (latest token-match wins) fixes retraction but collapses on collision and<br>recall — it drifts to the most recent look-alike, which is usually the wrong entity.
typed-constraint models time (facts retract) and identity (facts bind to an<br>entity), so it survives three categories. It still misses the one multi-hop recall<br>scenario — a deliberate frontier item no baseline solves , so the benchmark isn't<br>saturated.
The headline isn't "92%." It's that retrieval-quality metrics would rate all three systems<br>similarly, while their answer correctness ranges from 23% to 92%. That gap is the point.
The four failure modes
Category<br>One-line definition
Retraction<br>A fact is updated; the new value must win and the old must not surface.
Collision<br>Two similar entities; answer about the one asked, don't conflate.
Recall<br>Fact stated early, needed late, with noise (incl. a multi-hop frontier case).
Conflict<br>A fact is explicitly contradicted in-text; resolve to one current value.
Full definitions, worked examples, and why each one is hard are in<br>TAXONOMY.md.
Add your system
A system implements one small interface (src/types.ts):
; // called before each scenario<br>remember(text: string): void | Promise;<br>query(question: string): string | Promise;<br>}">interface MemorySystem {<br>readonly name: string;<br>reset(): void | Promisevoid>; // called before each scenario<br>remember(text: string): void | Promisevoid>;<br>query(question: string): string | Promisestring>;
Methods may be async, so an embedding store, a hosted memory product, or an LLM-backed<br>extractor plugs in exactly like the pure-code baselines. Drop your class into<br>src/systems/, add it to the list in src/run.ts, and run npm run bench. Use<br>npm run bench -- --fails to see every query your system missed and what it answered.
How it works
Scenarios (src/scenarios/) are ordered scripts of remember and query events.<br>Each query declares the substring the answer must contain and the stale substrings it<br>must not — so leaking an out-of-date...