Show HN: We matched full-context recall on ~1% of the tokens (open benchmark)

compresh-benchmarks/epbench/WRITEUP.md at main · compresh/compresh-benchmarks · GitHub

//blob/show" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

//blob/show;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

compresh

compresh-benchmarks

Public

Notifications You must be signed in to change notification settings

Fork

Star

FilesExpand file tree

main

/WRITEUP.md

Copy path

Blame More file actions

Latest commit

History History History

110 lines (83 loc) · 7.2 KB

main

/WRITEUP.md

Copy path

Top

File metadata and controls Preview

Code

Blame

110 lines (83 loc) · 7.2 KB

Raw Copy raw file Download raw file

OutlineEdit and raw actions

Fewer tokens, same recall: reconstruct context, don't resend it

Most LLM apps do the same thing every turn: they resend the entire conversation. The transcript grows linearly, each turn costs more than the last, and — past a point — the model gets worse, not better, because long context degrades (the "lost in the middle" effect, and what people now call context rot: quality drops well before the nominal window is full).

Compresh takes a different path. Instead of resending the whole history, it reconstructs a query-aware slice of it each turn — the part of the past this turn actually needs. The obvious question is whether recall survives when you stop sending the whole thing. So we measured it on an independent benchmark, and we publish where it wins and where it loses.

The axis: savings × quality, not recall alone

Most agent-memory work optimizes one number: recall or accuracy. We care about a different one — how few tokens you can send while holding quality. Two measurements:

Compression. On 360 real StackExchange Q&A items, replayed as one long, growing session, our open-source core (tulbase) sent 66% fewer input tokens (40.9M → 13.9M) with no measurable quality loss (answer equivalence 87.5% vs 90.0% raw; cosine 0.667 vs 0.670).

Reconstruction (the paid memory layer, TUL 2.0). On a strong model, a single turn goes from 31,947 → 275 input tokens (−99.1%) — it sends a query-aware slice, not the conversation. (The system prompt is left untouched.)

Fewer tokens is easy if you don't care about answers. The point is holding quality — so here's the benchmark.

The benchmark

We used EpBench — an independent, published episodic-memory benchmark (ICLR 2025; built on Tulving's model of recall): cued questions over a long, generated book. Same answerer (gpt-5-mini) and the same judge across every arm, scored with the benchmark's own method — no home-field scoring.

Method Simple recall Context read

raw / full context 0.804 196 chapters

naive RAG · chapter 0.796 17 chapters

Compresh · TUL 2.0 0.828 query-aware

The point is the juxtaposition — recall is essentially at parity while tokens are not:

EpBench · Simple Recall (paper method) · gpt-5-mini ────────────────────────────────────────────────────── Compresh · TUL 2.0 0.828 [█████████████████░░░] query-aware slice raw / full context 0.804 [████████████████░░░░] 196 chapters naive RAG · top-17 0.796 [████████████████░░░░] 17 chapters

Input tokens / turn (strong model, long chat) ────────────────────────────────────────────────────── raw 31,947 [████████████████████] Compresh 275 [▏░░░░░░░░░░░░░░░░░░░] −99.1%

Compresh has the highest simple recall while reading a query-aware slice, not the whole ~103k-token book — and pulls further ahead on multi-event questions (full per-bin breakdown in results/). Judge caveat, stated up front: our judge was OpenRouter gpt-4o; the paper's own judge puts raw at 0.830 — within ~2 points. Same judge for all arms.

You can reproduce the headline in ~10 seconds, no API keys: verify.py recomputes Simple Recall (the paper method — an unweighted mean over the matching-event bins) from the published per-bin recalls and checks it against the scoreboard.

Where it loses — and why that's the honest part

On chronological ordering , naive RAG beats us: 0.65 vs 0.44. Retrieving a query-relevant slice breaks temporal contiguity, so "put these events in order" gets harder. We publish that number next to the wins.

This isn't a confession of inferiority — it's the nature of the field. Every approach here...

Show HN: We matched full-context recall on ~1% of the tokens (open benchmark)

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI