Measuring tech debt in tokens instead of engineering hours

Tech debt as a token function | heysoup blog

← Back to notes

We used to measure tech debt in "development hours."

That was always a noisy metric. An engineer's hour includes standup, sprint planning, 1:1s, Slack threads, opening Jira, reading PR comments, figuring out what they were doing before the last meeting. The actual cognitive work — reasoning about the deferred decision — is a fraction of that hour. Maybe half. Maybe a third. You're measuring debt through a foggy lens, and the fog moves independently of the debt.

Coding agents changed something fundamental here, and I don't think the industry has fully absorbed it yet.

Agents don't go to meetings

An agent doesn't have a calendar. It doesn't context-switch between three repos. It doesn't check Slack before writing code. An agent's entire existence is throughput. Every token it spends is a token on the problem.

This inverts the tech debt analysis completely:

Before: "This deferred decision cost us 3 engineering days." What does that mean? 3 days of calendar time? 24 billable hours? 8 hours of actual reasoning plus 16 hours of meetings and recovery? The unit is soft. You can't graph it cleanly.

After: "This deferred decision cost us 6,500 tokens across three agent sessions." That's a hard number. You can sum it, average it, trend it. Every token was spent on one thing: re-deriving context that was already decided.

The duality of tokens

Tokens have a dual representation that makes them fundamentally different from hours as a unit:

Cost: N thousand tokens × price per token = a direct P&L line. Graphable against revenue, burn, margin.

Throughput: N thousand tokens = latency impact, context window pressure, agent cycles that could have been spent on new work instead of re-derivation. Also graphable.

Hours only had one axis. You could count them, but the conversion to actual cost required assumptions about productivity, meeting load, focus time — all the fog. Tokens give you two axes for the same data point, and both are clean.

A concrete exercise

Pick one deferred decision. Say: an auth spec change that was deprioritized and left without documentation. Imagine what happens across three agent sessions that encounter the gap:

Session one sees the inconsistency and spends 1,500 tokens re-deriving why the current approach exists, checking git blame, reading related files, forming a hypothesis.

Session two encounters the same gap and spends 2,000 tokens working around it — a different workaround, because the reasoning from session one wasn't captured anywhere.

Session three now has two conflicting workarounds AND the original gap. It spends 3,000 tokens reconciling the mess.

6,500 tokens total. One deferred decision. Every token was a direct, unavoidable cost of not capturing intent at decision time.

Now multiply by N decisions across M sessions. The function isn't linear — unreconciled workarounds compound. Each new session that encounters conflicting patches spends more tokens than the last one did.

The instrument at decision time

The cost of capturing the decision at the moment it was made is trivially small:

We chose bcrypt over argon2 because the team already knows the bcrypt API surface and argon2 adds audit complexity we can't review right now.

That's ~30 tokens. A single purpose string. Written at the moment of the decision.

Every agent session after that reads that string and moves on. Zero re-derivation tokens.

The balance sheet: 30 tokens in, 1,500+ tokens out per session. Positive after the first encounter. Massively positive after the tenth.

The bet

There's been a lot of discussion around token costs lately. Nearly all of it answers "what did we burn?" None of it answers "what should we measure before we defer?"

Tools that capture intent at decision time — purpose strings, instant import of context on session start, a portable record of reasoning that follows the work — fill that gap. Not because they're elegant. Because the numbers are undeniable when the unit is clean.

"Three days of dev time" is fuzzy. "Six thousand tokens of deferred decision" is a number you can put on a dashboard. And the fix — a 30-token purpose string at the moment of choice — costs less than this paragraph.

We used to measure debt in a unit that included standups. Now we measure it in tokens. The instrument is a sentence written before you move to the next thing.

That's the shift. Capture at decision time is what the graph demands when the unit of cost stops being an hour and starts being a thousand tokens.

Measuring tech debt in tokens instead of engineering hours

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI