Challenging Caveman

THOL — Token-Harness Optimizer Leaderboard

English Français Español Deutsch 中文

THOL — Token-Harness Optimizer Leaderboard

An open, reproducible, end-to-end benchmark of tools that claim to cut the token cost of coding agents. We measure the only thing that matters: did the task succeed, and what did the whole session actually cost — with and without each optimizer. Every run uses the Claude Sonnet model , 10 runs per task per tool on a single pinned Claude Code version.

Loading results…

What this table shows. How much each tool changes the cost of a coding session versus plain Claude Code, on long sessions only — tasks where vanilla Claude Code burns more than 200,000 tokens . A positive number means cheaper ; negative means more expensive.

Why long sessions: most real work with a coding agent is long and multi-step; short throwaway tasks aren't representative — every tool lands on the same near-zero cost there, so only fixed overhead shows.

All runs use the Claude Sonnet model. With more budget we'd like to push this to multi-million-token sessions; for now this is the regime we can afford to measure, and we add new tools in batches as budget allows.

How to read this

Cost reduction = how much cheaper an optimizer makes a session versus vanilla Claude Code, in end-to-end USD, geometric-mean-aggregated across tasks (so no single task dominates). Positive = cheaper, negative = more expensive; 0% = no difference , with control (vanilla Claude Code) sat at the 0% line.

Adoption = share of runs in which the agent actually invoked the optimizer's tools — a tool the agent never calls cannot save anything. It's shown as N/A for tools the agent doesn't explicitly call: rtk acts via an automatic hook, and prompt/context ones (lean-ctx, claude-token-efficient) are just text. (tokenade also has a hook, but it additionally exposes CLI functions — map, skeleton, query, exec… — the agent calls by hand, so its adoption counts those.)

Every raw per-run measurement is in results.json for your own analysis.

What we found so far

Used as documented, most of these tools don't beat plain Claude Code — several make it more expensive. The dominant reason is adoption : the agent simply doesn't call the optimizer's tools often enough for them to pay back their own overhead. A tool that adds a CLI, an MCP server or a prompt the agent ignores is pure cost.

Where any tool does help, it's on long, expensive sessions (see the session-size split below) — never on the short ones. New tools are added over time as token budget allows; this benchmark is expensive to run, so the board grows in batches rather than all at once.

Why most tools do poorly — despite great compression numbers

Almost every tool here advertises an impressive reduction (−58%, −90%, even −99%). Those numbers are usually real but narrow : they measure one of the tool's own functions in isolation — its compressor run on a fixed blob, or its search run on a fixed query — under tightly controlled conditions. That is not how an agent behaves on a real task. In real conditions, three things an isolated compression benchmark never sees usually erase the saving or reverse it:

Adoption is hard, and mis-adoption backfires. Getting a model to actually call an MCP/CLI tool is difficult; getting a net win is harder still, because the tool has to be used in the right context . When it isn't, the agent gets a poor result, falls back to its normal way of working (re-reading, re-searching) — and learns the tool is unhelpful, so it stops calling it . You pay the tokens for the failed call and lose future adoption. (We even tried forcing adoption with a verbose system prompt — it still cost more; see the GSP experiment below.)

Lossy output compression makes the agent re-fetch. For tools that compress command/CLI output, if the compaction drops the bytes the model actually needed, the model simply re-runs the command — often bypassing the tool — to recover them. Net result: more turns, more tokens, the opposite of the advertised saving.

Overhead is paid on every turn. Adding an MCP server or a system prompt injects tokens at the start of every conversation , and they are re-counted on every agent turn . Most of these are cache reads, billed at a reduced rate — but still billed . Over a long session this standing cost quietly accumulates, and for a tool the agent rarely uses it is pure loss.

Some tools break the context cache. A proxy like Headroom rewrites the growing conversation history on every turn, so the cached prefix no longer matches byte-for-byte. That forces the model to re-read the whole context as fresh input — billed at the full rate instead of the cached rate, which is ~10× cheaper . The few tokens its compression saves are dwarfed by the cache it destroys.

Together, this is why a tool can headline "−90% tokens" and still make a real session more expensive end to end — which is exactly what THOL measures.

Method & reproduction

Each run is one fully...

Challenging Caveman

Related Articles

(no title)

Scientists reverse brain aging, with a nasal spray

AI has torched the market for junior programmers

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org