Show HN: We matched full-context recall on ~1% of the tokens (open benchmark)

compresh1 pts0 comments

compresh-benchmarks/epbench/WRITEUP.md at main · compresh/compresh-benchmarks · GitHub

//blob/show" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//blob/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

compresh

compresh-benchmarks

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

FilesExpand file tree

main

/WRITEUP.md

Copy path

Blame<br>More file actions

Blame<br>More file actions

Latest commit

History<br>History<br>History

110 lines (83 loc) · 7.2 KB

main

/WRITEUP.md

Copy path

Top

File metadata and controls<br>Preview

Code

Blame

110 lines (83 loc) · 7.2 KB

Raw<br>Copy raw file<br>Download raw file

OutlineEdit and raw actions

Fewer tokens, same recall: reconstruct context, don't resend it

Most LLM apps do the same thing every turn: they resend the entire conversation. The transcript grows linearly,<br>each turn costs more than the last, and — past a point — the model gets worse, not better, because long<br>context degrades (the "lost in the middle" effect, and what people now call context rot: quality drops well<br>before the nominal window is full).

Compresh takes a different path. Instead of resending the whole history, it reconstructs a query-aware slice<br>of it each turn — the part of the past this turn actually needs. The obvious question is whether recall survives<br>when you stop sending the whole thing. So we measured it on an independent benchmark, and we publish where it<br>wins and where it loses.

The axis: savings × quality, not recall alone

Most agent-memory work optimizes one number: recall or accuracy. We care about a different one — how few<br>tokens you can send while holding quality. Two measurements:

Compression. On 360 real StackExchange Q&A items, replayed as one long, growing session, our open-source<br>core (tulbase) sent 66% fewer input tokens (40.9M → 13.9M) with<br>no measurable quality loss (answer equivalence 87.5% vs 90.0% raw; cosine 0.667 vs 0.670).

Reconstruction (the paid memory layer, TUL 2.0). On a strong model, a single turn goes from 31,947 →<br>275 input tokens (−99.1%) — it sends a query-aware slice, not the conversation. (The system prompt is left<br>untouched.)

Fewer tokens is easy if you don't care about answers. The point is holding quality — so here's the benchmark.

The benchmark

We used EpBench — an independent, published episodic-memory benchmark (ICLR 2025; built on Tulving's model<br>of recall): cued questions over a long, generated book. Same answerer (gpt-5-mini) and the same judge across<br>every arm, scored with the benchmark's own method — no home-field scoring.

Method<br>Simple recall<br>Context read

raw / full context<br>0.804<br>196 chapters

naive RAG · chapter<br>0.796<br>17 chapters

Compresh · TUL 2.0<br>0.828<br>query-aware

The point is the juxtaposition — recall is essentially at parity while tokens are not:

EpBench · Simple Recall (paper method) · gpt-5-mini<br>──────────────────────────────────────────────────────<br>Compresh · TUL 2.0 0.828 [█████████████████░░░] query-aware slice<br>raw / full context 0.804 [████████████████░░░░] 196 chapters<br>naive RAG · top-17 0.796 [████████████████░░░░] 17 chapters

Input tokens / turn (strong model, long chat)<br>──────────────────────────────────────────────────────<br>raw 31,947 [████████████████████]<br>Compresh 275 [▏░░░░░░░░░░░░░░░░░░░] −99.1%

Compresh has the highest simple recall while reading a query-aware slice, not the whole ~103k-token book —<br>and pulls further ahead on multi-event questions (full per-bin breakdown in<br>results/). Judge caveat, stated up front: our judge was<br>OpenRouter gpt-4o; the paper's own judge puts raw at 0.830 — within ~2 points. Same judge for all arms.

You can reproduce the headline in ~10 seconds, no API keys: verify.py recomputes Simple Recall<br>(the paper method — an unweighted mean over the matching-event bins) from the published per-bin recalls and<br>checks it against the scoreboard.

Where it loses — and why that's the honest part

On chronological ordering , naive RAG beats us: 0.65 vs 0.44. Retrieving a query-relevant slice breaks<br>temporal contiguity, so "put these events in order" gets harder. We publish that number next to the wins.

This isn't a confession of inferiority — it's the nature of the field. Every approach here...

recall compresh tokens context query benchmark

Related Articles