compresh-benchmarks/epbench/WRITEUP.md at main · compresh/compresh-benchmarks · GitHub
//blob/show" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//blob/show;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
compresh
compresh-benchmarks
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
FilesExpand file tree
main
/WRITEUP.md
Copy path
Blame<br>More file actions
Blame<br>More file actions
Latest commit
History<br>History<br>History
110 lines (83 loc) · 7.2 KB
main
/WRITEUP.md
Copy path
Top
File metadata and controls<br>Preview
Code
Blame
110 lines (83 loc) · 7.2 KB
Raw<br>Copy raw file<br>Download raw file
OutlineEdit and raw actions
Fewer tokens, same recall: reconstruct context, don't resend it
Most LLM apps do the same thing every turn: they resend the entire conversation. The transcript grows linearly,<br>each turn costs more than the last, and — past a point — the model gets worse, not better, because long<br>context degrades (the "lost in the middle" effect, and what people now call context rot: quality drops well<br>before the nominal window is full).
Compresh takes a different path. Instead of resending the whole history, it reconstructs a query-aware slice<br>of it each turn — the part of the past this turn actually needs. The obvious question is whether recall survives<br>when you stop sending the whole thing. So we measured it on an independent benchmark, and we publish where it<br>wins and where it loses.
The axis: savings × quality, not recall alone
Most agent-memory work optimizes one number: recall or accuracy. We care about a different one — how few<br>tokens you can send while holding quality. Two measurements:
Compression. On 360 real StackExchange Q&A items, replayed as one long, growing session, our open-source<br>core (tulbase) sent 66% fewer input tokens (40.9M → 13.9M) with<br>no measurable quality loss (answer equivalence 87.5% vs 90.0% raw; cosine 0.667 vs 0.670).
Reconstruction (the paid memory layer, TUL 2.0). On a strong model, a single turn goes from 31,947 →<br>275 input tokens (−99.1%) — it sends a query-aware slice, not the conversation. (The system prompt is left<br>untouched.)
Fewer tokens is easy if you don't care about answers. The point is holding quality — so here's the benchmark.
The benchmark
We used EpBench — an independent, published episodic-memory benchmark (ICLR 2025; built on Tulving's model<br>of recall): cued questions over a long, generated book. Same answerer (gpt-5-mini) and the same judge across<br>every arm, scored with the benchmark's own method — no home-field scoring.
Method<br>Simple recall<br>Context read
raw / full context<br>0.804<br>196 chapters
naive RAG · chapter<br>0.796<br>17 chapters
Compresh · TUL 2.0<br>0.828<br>query-aware
The point is the juxtaposition — recall is essentially at parity while tokens are not:
EpBench · Simple Recall (paper method) · gpt-5-mini<br>──────────────────────────────────────────────────────<br>Compresh · TUL 2.0 0.828 [█████████████████░░░] query-aware slice<br>raw / full context 0.804 [████████████████░░░░] 196 chapters<br>naive RAG · top-17 0.796 [████████████████░░░░] 17 chapters
Input tokens / turn (strong model, long chat)<br>──────────────────────────────────────────────────────<br>raw 31,947 [████████████████████]<br>Compresh 275 [▏░░░░░░░░░░░░░░░░░░░] −99.1%
Compresh has the highest simple recall while reading a query-aware slice, not the whole ~103k-token book —<br>and pulls further ahead on multi-event questions (full per-bin breakdown in<br>results/). Judge caveat, stated up front: our judge was<br>OpenRouter gpt-4o; the paper's own judge puts raw at 0.830 — within ~2 points. Same judge for all arms.
You can reproduce the headline in ~10 seconds, no API keys: verify.py recomputes Simple Recall<br>(the paper method — an unweighted mean over the matching-event bins) from the published per-bin recalls and<br>checks it against the scoreboard.
Where it loses — and why that's the honest part
On chronological ordering , naive RAG beats us: 0.65 vs 0.44. Retrieving a query-relevant slice breaks<br>temporal contiguity, so "put these events in order" gets harder. We publish that number next to the wins.
This isn't a confession of inferiority — it's the nature of the field. Every approach here...