0x022 – Token Optimizers

vismit20001 pts0 comments

Token Optimizers ✂️ Explained

New format experiment. I want to ship more often, so I'm trying shorter issues: same research, fewer words, you get the gist (and the topic) faster. Tell me if it lands well, you can always just hit reply.<br>Uber torched its entire 2026 AI budget in four months, mostly on Claude Code. Token bills are the new cloud bills, and a pile of tools has shown up to cut them.<br>TL;DR (for devs watching their Claude Code bill climb) - agents stuff the context with junk, which makes them slower, pricier, and dumber. Token optimizers cut the junk before it hits the model.<br>How does it work? 💡<br>Every npm test, every git status, every time it re-reads the same 600-line file, the raw output gets dumped straight into context. Thousands of tokens of "all tests passed" logs and endless file paths. Token optimizers sit between the agent and the model and trim the waste before it lands: compress command output, swap repeat reads for diffs, dedupe repeated lines.<br># npm test<br>raw: 4,000 tokens of "✓ passed" spam<br>optimized: "47 passed, 1 failed: auth.test.ts:23" ~40 tokens

# git status<br>raw: 800 tokens of file paths<br>optimized: "3 modified, 2 untracked" ~15 tokens

# re-reading a file<br>raw: 6,000 tokens (3rd time this session)<br>optimized: unified diff, not the whole file ~50 tokens<br>And it's not just the bill. Long, junk-filled context actually degrades answers. Models get lost as the input grows, well before the window is full.<br>Accuracy slides as input grows. Chroma's "context rot", measured across 18 frontier models.<br>Thoughts 🧠<br>✅ Why use - three wins in one: a smaller bill, faster replies (less to prefill), and often better output, since the model isn't wading through noise. The savings are real, even if the headline percentages are all self-reported.<br>🙅 Why not - these tools decide what counts as "junk," and one bad cut makes your agent quietly dumber without you noticing ("silently dangerous"?). You're also piping all your code, secrets in tool output included, through a third-party proxy. And the harnesses are catching up on their own: Claude Code auto-compacts and clears stale tool results, and Anthropic's context editing claims ~84% fewer tokens. Might not be needed for long.<br>⚔️ Players - rtk (the ~60k-star flagship, compresses dev-command output), headroom (compresses, but can fetch originals back), caveman (makes the model itself talk terser, fewer output tokens), token-optimizer (a Claude Code skill). Adjacent: TOON (token-thrifty JSON) and claude-mem (persistent memory, so it re-reads less). On the semantic side: LLMLingua (the OG token-dropper) and The Token Company (YC, the rare pure-play).<br>🔮 Prediction - labs make money per token, so their incentives here aren't super aligned with yours. The value probably lands inside AI gateways as a feature (LiteLLM already ships one), not a standalone company. (🔥 hot take: most pure token-optimizer startups won't make it). And it won't stay a coding thing; the same bloat hits RAG and long chat histories. Eventually a "token budget" becomes a normal ops metric. FinOps for your AI spend.<br>"token optimization" vs "prompt compression," past 5 years (Google Trends). Flat for years, then a sharp spike.<br>Further reading 📚<br>Lost in the Middle - the classic on why models ignore the middle of long context.<br>rtk + headroom, combined - the canonical stack, walked through.<br>ai-cost-optimization - my running list of AI cost-cutting techniques beyond tokens (caching, batching, routing, quantization).<br>batchata - my unified Python API for batch LLM requests (~50% off on most providers).<br>Thanks to @TomGranot, the sharpest tool-scout I know, who edits every issue.

Agam More

Don't miss the next dev trend 👻

4,000+ devs have already subscribed.

Enter your email

Subscribe

You might also like...

Powered by Ghost

token tokens context output claude code

Related Articles