The Speedometer I Made Up: Building a Token Budget Governor for AI Coding Agents | Ryan Duffy - Building with AI
Go back<br>The Speedometer I Made Up: Building a Token Budget Governor for AI Coding Agents<br>23 Jun, 2026<br>· 18 min read |<br>I run Claude Code and Codex hard — multiple sessions, long hauls. On the Max 20× plan that’s comfortable. But I wanted to know if I could drop to 5×, and that meant trusting the guardrails I’d spent eleven days building. So I audited them.
The audit found something I didn’t expect: my budget governor had been faithfully protecting me against a speed limit measured by a speedometer I’d built myself. This is how the guards evolved — and how each one had already mutated away from its naive first version.
The ramp that forced this
I didn’t build a budget governor because I enjoy writing hooks. I built it because the curve got scary. Here’s five and a half months of local usage (from Jan 9, via ccusage), by month:
Monthraw tokgenerated (in+out)notional cost*January (from the 9th)93 M0.6 M$61February1.6 B2.3 M$1,375March3.1 B7.2 M$2,472April2.1 B7.2 M$1,895May7.4 B36 M$5,896June (to the 23rd) 11.5 B 161 M $11,223<br>* Notional — ccusage’s pricing-table estimate, not a bill; on a subscription the marginal cost is ~$0. Raw tokens are ~98% cache reads, so the generated column (input+output) is the honest “real work” signal.
The shape is the whole reason this post exists. Usage didn’t drift upward — it ramped ~5× from spring into June : with the month only two-thirds gone, June’s generated work alone (161 M ) already outran all five prior months combined (~53 M). Real generated work went from ~7 M/month in spring to 161 M in June . Everything below — the weighted proxy, the warn-only pacing, the Codex routing, the deterministic guards — was built on the steepest part of that curve, not in calm water. The governor was a response, not a premonition. (Same caveats as everywhere here: single machine, notional dollars, and the data only starts Jan 9 — there’s no local transcript before that.)
The core method: don’t count tokens, weight them
Every version rests on one decision. The weekly cap isn’t about raw token volume — different token types cost wildly different amounts. So the guard scores spend with a price-weighted proxy, not a counter:
WEIGHTS = {<br>"input_tokens": 1.0,<br>"cache_creation_input_tokens": 1.25,<br>"cache_read_input_tokens": 0.1, # cache reads are cheap…<br>"output_tokens": 5.0, # …output is 5x input<br># then scaled per-model: opus 1.0 (baseline), sonnet 0.6, haiku 0.2, fable 2.0<br>That 0.1 on cache reads looks innocent. Hold onto it — it’s the whole reason loops are dangerous, and we’ll come back to it.
Here’s the honest caveat, lifted straight from my own code comment:
Anthropic does not document how /usage computes the weekly %, so the price ratio is a defensible proxy for subscription-cap consumption, not a first-party-confirmed weighting.
I built a governor on a number I reverse-engineered. Keep that in mind too.
v0 → v1: the guard that punished me for being early
The first version watched the weighted spend and, when I was burning too fast, force-compacted my context mid-session to save tokens. It worked and it was awful — because ahead of pace isn’t over budget. Front-load a heavy Monday and the naive projection screams “you’ll blow the cap!” and compacts you, even though you’ll coast all week.
The fix was to make the guard forward-looking and warn-only (condensed):
# projected end-of-cycle spend at the RECENT burn rate, not cumulative.<br># An idle/frugal day lowers `rate` -> projection drops -> the nag relaxes itself.<br>projected = spent + recent_rate * time_left<br>PROJ_BANDS = [ # (projected fraction of cap, compaction ceiling)<br>(1.3, 250_000), # on track to bust cap by >30%<br>(1.1, 350_000),<br>(1.0, 450_000),<br># Only a genuine near-lockout (>=95% ACTUAL spend) ever hard-clamps.<br># Being on-trajectory-to-bust is WARN-ONLY.<br>The thinking: a guard that blocks you costs more in broken flow than the overspend it prevents. The mature version advises, and lets the projection relax on its own when you behave. First law of guard evolution: block → advise.
The first time the meter lied (overcounting)
While building this I found the proxy running ~2.4× hot . Claude Code’s transcript logs repeat usage entries on session resume, compaction, and sidechain replay — I measured ~58% duplicates . The guard was counting the same tokens two or three times and panicking accordingly. The fix was a dedup key per assistant message:
key = f"{msg['id']}|{entry.get('requestId')}"<br>if key in seen: # duplicate replay — already counted this cycle<br>continue<br>Lesson banked: before you trust a guardrail, verify the number it reads. I’d need that lesson again.
v2 → v3: stop guarding, start routing
Codex joined the workflow, so “my budget” became two budgets with different reset windows. The guard grew into cross-vendor governance — an architecture decision record, a Codex-side rate-limit mirror, and a tier lever...