Show HN: A benchmark for the failure modes of agent memory

Pankhi1231 pts0 comments

GitHub - Kausha3/agent-memory-bench: An open benchmark for the failure modes of agent memory systems: retraction, collision, recall, conflict. Offline, zero-dependency, reproducible. · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

Kausha3

agent-memory-bench

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>2 Commits<br>2 Commits

.github/workflows

.github/workflows

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

TAXONOMY.md

TAXONOMY.md

package-lock.json

package-lock.json

package.json

package.json

tsconfig.json

tsconfig.json

View all files

Repository files navigation

agent-memory-bench

An open benchmark for the failure modes of agent memory systems.

Everyone shipping an AI agent bolts on a "memory," and everyone evaluates it the same<br>shallow way: did retrieval fetch a relevant chunk? But agents don't fail in the field<br>because retrieval missed. They fail because the fact they retrieved was stale ,<br>belonged to the wrong entity , was buried under noise , or contradicted another<br>fact the system also believed. Those are the bugs that make an agent confidently wrong.

agent-memory-bench scores those four failure modes directly — and it runs offline,<br>with zero dependencies and no API key , so the leaderboard is reproducible by anyone in<br>one command.

npm install<br>npm run bench # prints the leaderboard below<br>npm test # adversarial tests for the scoring core + baselines

Leaderboard

Reference baselines across 13 scenarios in 4 categories. Numbers are produced by<br>npm run bench — reproduce them yourself.

system<br>retraction<br>collision<br>recall<br>conflict<br>overall

typed-constraint<br>100%<br>100%<br>75%<br>100%<br>92%

keyword<br>0%<br>100%<br>75%<br>0%<br>46%

recency<br>100%<br>0%<br>0%<br>0%<br>23%

Read this as a map of where each strategy breaks, not a ranking of products:

keyword (similarity retrieval, no model of time) aces collision but scores 0% on<br>retraction and conflict — with no notion of time it happily returns the value the user<br>already changed.

recency (latest token-match wins) fixes retraction but collapses on collision and<br>recall — it drifts to the most recent look-alike, which is usually the wrong entity.

typed-constraint models time (facts retract) and identity (facts bind to an<br>entity), so it survives three categories. It still misses the one multi-hop recall<br>scenario — a deliberate frontier item no baseline solves , so the benchmark isn't<br>saturated.

The headline isn't "92%." It's that retrieval-quality metrics would rate all three systems<br>similarly, while their answer correctness ranges from 23% to 92%. That gap is the point.

The four failure modes

Category<br>One-line definition

Retraction<br>A fact is updated; the new value must win and the old must not surface.

Collision<br>Two similar entities; answer about the one asked, don't conflate.

Recall<br>Fact stated early, needed late, with noise (incl. a multi-hop frontier case).

Conflict<br>A fact is explicitly contradicted in-text; resolve to one current value.

Full definitions, worked examples, and why each one is hard are in<br>TAXONOMY.md.

Add your system

A system implements one small interface (src/types.ts):

; // called before each scenario<br>remember(text: string): void | Promise;<br>query(question: string): string | Promise;<br>}">interface MemorySystem {<br>readonly name: string;<br>reset(): void | Promisevoid>; // called before each scenario<br>remember(text: string): void | Promisevoid>;<br>query(question: string): string | Promisestring>;

Methods may be async, so an embedding store, a hosted memory product, or an LLM-backed<br>extractor plugs in exactly like the pure-code baselines. Drop your class into<br>src/systems/, add it to the list in src/run.ts, and run npm run bench. Use<br>npm run bench -- --fails to see every query your system missed and what it answered.

How it works

Scenarios (src/scenarios/) are ordered scripts of remember and query events.<br>Each query declares the substring the answer must contain and the stale substrings it<br>must not — so leaking an out-of-date...

agent memory bench string query json

Related Articles