Evaluate Your Agentic Tooling | Peter Baumgartner
Status: WIP
tl;dr : Evaluate all your agentic tools in realistic end-to-end agentic tasks. Claims about token reduction from tools doesn’t transfer from experimental conditions to all agentic workflows. At the time of writing, agents also prefer their own tools and workflows, and users should not expect tools to have their intended effect without additional usage enforecement.
Intro#
Organizations are entering the tokenmaxxing hangover stage now. And with that, lots of tooling is popping up clamiing to reduce token usage.
I’m not an organization, but I’d love to use fewer tokens as well! As a Claude Pro subscriber I’d love to get more than 1.5 Opus 4.8 runs out of my session limit.
To that end, I ran an experiment with 4 tools I know of that claim to reduce token usage and built an evaluation harness to do so.
The Experiment#
The Tools (Interventions)#
To introduce the tools evaluated, here are the claims on token reduction from each of them:
A Claude Code skill/plugin (also Codex, Gemini, Cursor, Windsurf, Cline, Copilot, 30+ more) that makes agent talk like caveman — cuts ~75% of output tokens, keeps full technical accuracy.
- caveman
High-performance CLI proxy that reduces LLM token consumption by 60-90%
- rtk
Uses ~98% fewer tokens than grep+read
- semble
Evaluated across 31 real-world repositories: 83% answer quality, 10× fewer tokens, 2.1× fewer tool calls vs. file-by-file exploration.
- codebase-memory-mcp
I called each of these tools Interventions as they take various forms: caveman is a skill, rtk is a CLI proxy, semble and codebase-memory-mcp are both MCP & CLI.
The Experimental Arms#
Each intervention also provides its own standard install procedure for making the intervention available to a coding agent. These install procedures also differ between Claude Code and Codex.
To help standardize their usage across agents and tasks, I created 3 experimental arms for each intervention (where appliccable):
standard - the intervention and instructions are installed as close as possible to the interventions default instructions
steered - a standard installation, with additional text added to the user-level AGENTS.md/CLAUDE.md steering the agent to use the intervention
enforced (codebase-memory-mcp and semble) - the steered installation, plus removal of the typical tools used for the task (i.e. disallowing Grep / Bash(rg))
In addition to these arms, I included a frugal prompt that instructs the agent to complete the task in as few tool calls / turns as possible.
The Models and Task#
I evaluated gpt-5.5 with Codex along with claude-opus-4-8 and claude-opus-4-6 in Claude Code. The default thinking/effort settings were used for all models.
The task for each agent to complete with each tools and the baseline were 5 tasks selected from SWE Bench Pro. I selected 5 tasks that were ‘complex’ as measured by the number of files touched and lines of code changed in the patch. The 5 tasks spanned the ansible, openlibrary, and qutebrowser repositories.
This resulted in 60 experiments per model: (5 tasks) x (12 interventions: baseline, frugal, caveman x2, rtk x2, semble x3, codebase-memory-mcp x3)
Results#
No intervention demonstrated robust token savings at the task level across all models.
To measure this, we use the ratio of the effective intervention token cost / effective reference token cost. The “effective” token cost here includes the use of cache write and cache reads.
Standard arms compare to baseline; steered/enforced arms compare to frugal. Lower than 1.0x is cheaper.
intervention / arm<br>Opus 4.6<br>Opus 4.8<br>GPT-5.5<br>pooled
frugal<br>0.932x<br>1.064x<br>0.761x<br>0.910x
caveman<br>1.045x<br>0.934x<br>0.804x<br>0.923x
semble<br>1.000x<br>1.001x<br>0.913x<br>0.970x
codebase-memory-mcp--steered<br>0.981x<br>1.050x<br>0.934x<br>0.987x
codebase-memory-mcp<br>1.205x<br>1.170x<br>0.862x<br>1.067x
caveman--steered<br>0.941x<br>0.994x<br>1.303x<br>1.068x
rtk--steered<br>0.900x<br>1.188x<br>1.535x<br>1.180x
rtk<br>1.208x<br>1.419x<br>1.097x<br>1.234x
codebase-memory-mcp--enforced<br>1.351x<br>1.281x<br>1.124x<br>1.248x
semble--steered<br>1.254x<br>1.214x<br>1.306x<br>1.257x
semble--enforced<br>1.446x<br>1.377x<br>1.544x<br>1.454x
So our ‘best’ intervention on average is frugal at ~9% savings – a prompt that GPT 5.5 itself, which is:
Minimize file reading. Search narrowly before reading. Never re-read unchanged files. Keep outputs terse. Prefer targted inspection over broad exploration, and open files only when the next edit or decision requires exact local context. Avoid broad listing unless narrow search fails.<br>Interpretation & Learnings#
Some figures here will be helpful:
Output is a small percentage of the total token cost – so something like caveman has a low cieling of possible impact. From a token perspective, output tokens are ~1% of billed tokens, so the token reduction there is still tiny.
Here’s where context comes from on the baseline runs across all tasks. Most of the tokens...