THOL — Token-Harness Optimizer Leaderboard
English<br>Français<br>Español<br>Deutsch<br>中文
THOL — Token-Harness Optimizer Leaderboard
An open, reproducible, end-to-end benchmark of tools that claim to cut the<br>token cost of coding agents. We measure the only thing that matters:<br>did the task succeed, and what did the whole session actually cost — with and without each optimizer.<br>Every run uses the Claude Sonnet model , 10 runs per task per tool on a single pinned Claude Code<br>version.
Loading results…
What this table shows. How much each tool changes the cost of a coding session versus plain<br>Claude Code, on long sessions only — tasks where vanilla Claude Code burns more than<br>200,000 tokens . A positive number means cheaper ; negative means more expensive.
Why long sessions: most real work with a coding agent is long and multi-step; short throwaway<br>tasks aren't representative — every tool lands on the same near-zero cost there, so only fixed overhead shows.
All runs use the Claude Sonnet model. With more budget we'd like to push this to<br>multi-million-token sessions; for now this is the regime we can afford to measure, and we add new<br>tools in batches as budget allows.
How to read this
Cost reduction = how much cheaper an optimizer makes a session versus vanilla Claude Code, in<br>end-to-end USD, geometric-mean-aggregated across tasks (so no single task dominates). Positive =<br>cheaper, negative = more expensive; 0% = no difference , with control (vanilla Claude Code)<br>sat at the 0% line.
Adoption = share of runs in which the agent actually invoked the optimizer's tools — a tool the<br>agent never calls cannot save anything. It's shown as N/A for tools the agent doesn't explicitly<br>call: rtk acts via an automatic hook, and prompt/context ones (lean-ctx, claude-token-efficient) are<br>just text. (tokenade also has a hook, but it additionally exposes CLI functions — map, skeleton, query,<br>exec… — the agent calls by hand, so its adoption counts those.)
Every raw per-run measurement is in results.json for your own analysis.
What we found so far
Used as documented, most of these tools don't beat plain Claude Code — several make it<br>more expensive. The dominant reason is adoption : the agent simply doesn't call the optimizer's<br>tools often enough for them to pay back their own overhead. A tool that adds a CLI, an MCP server or a<br>prompt the agent ignores is pure cost.
Where any tool does help, it's on long, expensive sessions (see the session-size split<br>below) — never on the short ones. New tools are added over time as token budget allows; this benchmark is<br>expensive to run, so the board grows in batches rather than all at once.
Why most tools do poorly — despite great compression numbers
Almost every tool here advertises an impressive reduction (−58%, −90%, even −99%). Those numbers are<br>usually real but narrow : they measure one of the tool's own functions in isolation — its compressor<br>run on a fixed blob, or its search run on a fixed query — under tightly controlled conditions. That is not how an<br>agent behaves on a real task. In real conditions, three things an isolated compression benchmark never sees<br>usually erase the saving or reverse it:
Adoption is hard, and mis-adoption backfires. Getting a model to actually call an MCP/CLI tool is<br>difficult; getting a net win is harder still, because the tool has to be used in the right<br>context . When it isn't, the agent gets a poor result, falls back to its normal way of working<br>(re-reading, re-searching) — and learns the tool is unhelpful, so it stops calling it . You pay the<br>tokens for the failed call and lose future adoption. (We even tried forcing adoption with a verbose<br>system prompt — it still cost more; see the GSP experiment below.)
Lossy output compression makes the agent re-fetch. For tools that compress command/CLI output, if the<br>compaction drops the bytes the model actually needed, the model simply re-runs the command — often<br>bypassing the tool — to recover them. Net result: more turns, more tokens, the opposite of the advertised<br>saving.
Overhead is paid on every turn. Adding an MCP server or a system prompt injects tokens at the<br>start of every conversation , and they are re-counted on every agent turn . Most of these are<br>cache reads, billed at a reduced rate — but still billed . Over a long session this standing cost<br>quietly accumulates, and for a tool the agent rarely uses it is pure loss.
Some tools break the context cache. A proxy like Headroom rewrites the growing<br>conversation history on every turn, so the cached prefix no longer matches byte-for-byte. That forces the<br>model to re-read the whole context as fresh input — billed at the full rate instead of the cached<br>rate, which is ~10× cheaper . The few tokens its compression saves are dwarfed by the cache it<br>destroys.
Together, this is why a tool can headline "−90% tokens" and still make a real session more<br>expensive end to end — which is exactly what THOL measures.
Method & reproduction
Each run is one fully...