99% compressed, 1% on the bill: I audited 1B tokens to find out why

josuramos1 pts0 comments

99% compressed, 1% on the bill — I audited 1B tokens · NUXS

Late last year I went through this firsthand: massive token burn on my team. And a paradox you might recognize. The cheaper tokens got, the higher the bill came in. Better, faster models invite heavier use, and consumption grows faster than prices fall. Budgets kept rising instead of leveling off.

I tried everything to cut costs. Eventually I looked into data-compression companies and GitHub repos, and here I had a home-field advantage: I come from building bots for prediction markets, and squeezing data is what I've done for years. To my surprise, it was disappointing. What I found compressed text as text, in the academic state-of-the-art style (LLMLingua and its successors): it drops the tokens with the lowest statistical weight. Works for prose. Fails on exactly what an agent eats all day, like logs, SQL schemas, diffs, stack traces, test output, and API responses. Generic compression fails the same way.

To be clear about where the cost comes from: an LLM generates spend on four sides of the bill, which are input, cache write, cache read, and output. Data compression never touches the output, one of the biggest bottlenecks. And in every compressor I looked at, I ran into the same thing. They sold a high compression ratio as if it were savings, with no clear study behind it. It isn't the same thing, and that gap was exactly what I wanted to measure.

Tired of all this, I started working on an architecture of my own. The idea was that the real savings trigger wasn't compressing more, but mapping and tracking correctly what passed through, with savings ceilings that depend on how each person uses it. My compressions reach 99.9% compression margin. But how much of that actually comes off my bill? That was the question that pulled the rest along.

The intuition behind it is simple. Good mapping gives the AI only what it needs to keep working, meaning the fewest tokens in the end. And it isn't only money. It gets faster and more accurate, because the model's attention is finite and a clean context goes further.

But compressing input wasn't enough. I had to look at the output too.

§ 01How this became engineering

Structured data isn't prose. A log is a pattern that repeats with variations; a schema is a shape, not a sentence. So instead of a generic compressor, I wrote 20 specialized parsers (the capsules and the multimodals), each with its own opinion about what's signal and what's noise in its format.

But the engineering that matters isn't in the capsules on their own. It's in the three-layer engine on top of them. Each layer reaches savings the previous one can't, and they work like a filter, in this order:

Layer 1, Capsule (runs first). 17 specialist capsules plus 3 multimodals. When the data is the capsule's type, it compresses better than any generic compressor while preserving structure: 87 to 95% margin. What it doesn't recognize, it lets through on purpose. In specialist use (hundreds of PDFs, a shopping agent on the multimodal), the capsule alone already cuts cost heavily within its parameters and hands the rest off to what comes next.

Layer 2, Squeeze (runs next, on what's left). It intercepts the traffic that had no dedicated capsule and keeps the reference recoverable, so the agent restores the data when it needs to. What the capsule already compressed doesn't pass through Squeeze again. The result: coverage rises from ~46% (from the capsules) to as much as 84% (the ceiling reached in the study), and effective margin reaches 80.8%, which is the number actually saved.

Layer 3, Economy (reaches the output). Capsule and Squeeze act on input only. The output is the side of the bill no compression touches. Economy generates substantial savings there through smart routing between models, with the exact amount varying by task and usage profile.

Architecture — the three-layer engine

Agent · Hook

Layer 1 · Capsule — runs first

Algorithmic

11

log · api · network · schema · codebase · diff · test · build · apispec · prompt · image

Deterministic. Zero marginal cost. Runs locally.

LLM-based

rag · sql · stack · threads · events · pdf

Query-aware abstractive digest (RECOMP) on the user's own keys.

Multimodal

image-LLM · meeting · video

Measured in a separate phase with real provider calls.

Dense index

always delivered to the agent

Raw body

one retrieve away, byte-exact

what slips past the capsules

Layer 2 · Squeeze — runs on what's left

Squeeze

Intercepts traffic with no dedicated capsule. Maps the conversation between turns, expels cold context, leaves a recoverable reference. What a capsule already compressed doesn't re-enter Squeeze.

coverage up to 84%<br>80.8% effective · 99.1%

input handled · output remains

Layer 3 · Economy — reaches the output

Economy

Capsule and Squeeze act on input only. Economy routes generation to a cheaper model on the user's own keys, and closes the only side that was left open: the...

capsule layer output squeeze bill data

Related Articles