Never Waste a Token

gmays2 pts0 comments

never waste a token • Solving the decision problem skip to content

Close

Table of Contents<br>## the problem that’s easy to miss<br>## the move: stop tying the request to the process<br>## one log, two readers<br>## tailing without polling<br>## replay without writing a single SSE parser<br>## wait, does anyone else do this?<br>## the punchline: this is coming to AI Gateway<br>## takeaways<br>never waste a token<br>15 June 2026 durable inference: resumable streams, crash recovery, and why the LLM request shouldn't die with your process.

(this post itself is LLM slop, but it tastes alright)

tl;dr - put a durable buffer between your agent and the LLM provider. the provider connection now outlives your process, so a deploy in the middle of a stream doesn’t cost you the tokens you already paid for. and the same buffer that lets a disconnected browser catch back up is the thing that recovers a crashed turn. one log, two readers.

I’ve spent the last few weeks stuck on one question: what happens to an agent when the process running it dies in the middle of a turn?

it goes deep fast. tool calls that may or may not have fired. sub-agents. half-written streams waiting on a human. I’m writing all of that up separately (durable agent loops, coming soon). but one piece of it is small and self-contained enough to pull out on its own:

when your process dies mid-inference, you don’t just lose your place. you lose money.

the problem that’s easy to miss

your agent opens a streaming request to a model, and the model starts generating. you’re billed for those output tokens the moment they’re generated. then your process gets replaced. maybe a deploy, maybe an eviction, maybe an OOM.

the usual reassurance is “don’t worry, the state is durable.” and sure, your conversation history survived. but the in-flight HTTP request to the provider did not. it lived in the memory of the process that just died. so when you recover, your only option is to make the call again . you pay for those output tokens a second time.

now make it an agent. a real one does multiple tool calls in a single turn:

user message<br>→ stream some text<br>→ tool call → tool result<br>→ stream more text<br>→ tool call → tool result<br>→ stream the answer<br>every interruption throws away all the output tokens generated so far in that turn. and it scales with the model you actually want to use: output runs $30 per million tokens on gpt-5.5 versus $2 on gpt-5.5-mini, so a flagship retry burns ~15x what a mini one does. the better the model, the more it hurts. deploys happen constantly, evictions happen constantly, and each one that lands on a live stream is money straight out the window.

the happy path hides it. you only see it when you start counting tokens after an incident and the numbers don’t add up.

the move: stop tying the request to the process

the reason a crash wastes tokens is that the provider connection lives inside the thing that crashed. so move it out.

put a buffer between the agent and the provider, and make it a separate deployment : its own Worker, its own Durable Object.

when a request comes in, the buffer does three things in order. it resets its state for a fresh stream. it kicks off a background task that drains the provider connection into SQLite. and it immediately hands the caller back a stream that tails those same rows as they land:

async proxyAndBuffer(req: ProviderRequest): PromiseResponse> {<br>this.resetBuffer(); // status = "streaming", chunkCount = 0<br>const reader = (await fetch(req.url, req)).body!.getReader();

// drain the provider in the background. deliberately NOT awaited - the<br>// response below returns right away while this keeps running.<br>this.keepAliveWhile(() => this.consumeProvider(reader));

// give the caller a stream that tails the rows as they're written.<br>return new Response(this.tailFrom(0), {<br>headers: { "X-Buffer-Status": "streaming" }<br>});

private async consumeProvider(reader: Reader) {<br>for (let i = 0; ; i++) {<br>const { done, value } = await reader.read();<br>if (done) break;<br>this.sql`INSERT INTO buffer_chunks VALUES (${i}, ${decode(value)})`;<br>this.notify(); // wake any tailers (more below)<br>this.setStatus("completed");<br>the load-bearing part is what consumeProvider is not attached to. it doesn’t run inside the agent. it runs here, in a separate deployment that wasn’t touched by the agent’s deploy. so when the agent gets evicted mid-stream and its tail connection is cancelled, the drain loop keeps reading. the tokens you paid for keep landing in SQLite, whether or not anyone’s listening.

keepAliveWhile is what holds the buffer open while it drains. a long generation has quiet stretches, and a Durable Object can be evicted for looking idle. keepAliveWhile heartbeats an alarm for the duration of the drain and drops it the moment the task finishes or throws, so the buffer survives those gaps without leaking a heartbeat afterwards.

when the agent restarts, it calls /resume?from=N and gets the chunks it missed. no wasted tokens, no duplicate provider...

agent stream tokens process provider buffer

Related Articles