Show HN: A benchmark for the failure modes of agent memory

GitHub - Kausha3/agent-memory-bench: An open benchmark for the failure modes of agent memory systems: retraction, collision, recall, conflict. Offline, zero-dependency, reproducible. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Kausha3

agent-memory-bench

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 2 Commits 2 Commits

.github/workflows

src

.gitignore

LICENSE

README.md

TAXONOMY.md

package-lock.json

package.json

tsconfig.json

View all files

Repository files navigation

agent-memory-bench

An open benchmark for the failure modes of agent memory systems.

Everyone shipping an AI agent bolts on a "memory," and everyone evaluates it the same shallow way: did retrieval fetch a relevant chunk? But agents don't fail in the field because retrieval missed. They fail because the fact they retrieved was stale , belonged to the wrong entity , was buried under noise , or contradicted another fact the system also believed. Those are the bugs that make an agent confidently wrong.

agent-memory-bench scores those four failure modes directly — and it runs offline, with zero dependencies and no API key , so the leaderboard is reproducible by anyone in one command.

npm install npm run bench # prints the leaderboard below npm test # adversarial tests for the scoring core + baselines

Leaderboard

Reference baselines across 13 scenarios in 4 categories. Numbers are produced by npm run bench — reproduce them yourself.

system retraction collision recall conflict overall

typed-constraint 100% 100% 75% 100% 92%

keyword 0% 100% 75% 0% 46%

recency 100% 0% 0% 0% 23%

Read this as a map of where each strategy breaks, not a ranking of products:

keyword (similarity retrieval, no model of time) aces collision but scores 0% on retraction and conflict — with no notion of time it happily returns the value the user already changed.

recency (latest token-match wins) fixes retraction but collapses on collision and recall — it drifts to the most recent look-alike, which is usually the wrong entity.

typed-constraint models time (facts retract) and identity (facts bind to an entity), so it survives three categories. It still misses the one multi-hop recall scenario — a deliberate frontier item no baseline solves , so the benchmark isn't saturated.

The headline isn't "92%." It's that retrieval-quality metrics would rate all three systems similarly, while their answer correctness ranges from 23% to 92%. That gap is the point.

The four failure modes

Category One-line definition

Retraction A fact is updated; the new value must win and the old must not surface.

Collision Two similar entities; answer about the one asked, don't conflate.

Recall Fact stated early, needed late, with noise (incl. a multi-hop frontier case).

Conflict A fact is explicitly contradicted in-text; resolve to one current value.

Full definitions, worked examples, and why each one is hard are in TAXONOMY.md.

Add your system

A system implements one small interface (src/types.ts):

; // called before each scenario remember(text: string): void | Promise; query(question: string): string | Promise; }">interface MemorySystem { readonly name: string; reset(): void | Promisevoid>; // called before each scenario remember(text: string): void | Promisevoid>; query(question: string): string | Promisestring>;

Methods may be async, so an embedding store, a hosted memory product, or an LLM-backed extractor plugs in exactly like the pure-code baselines. Drop your class into src/systems/, add it to the list in src/run.ts, and run npm run bench. Use npm run bench -- --fails to see every query your system missed and what it answered.

How it works

Scenarios (src/scenarios/) are ordered scripts of remember and query events. Each query declares the substring the answer must contain and the stale substrings it must not — so leaking an out-of-date...

Show HN: A benchmark for the failure modes of agent memory

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7