How to cut the cost of long AI agent threads (without making the agent dumber)

peteralbert1 pts0 comments

How We Built Viktor Around Prompt Caching (80% Cheaper Agent Threads) | Viktor Blog<br>Get Started for Free

Back to BlogCopy for AI

Key Takeaways<br>Model APIs are (mostly) stateless, so agents re-send their entire history on every call. A 40-step Viktor thread transmits ~2.17M input tokens even though the transcript is only ~85K tokens long.<br>Prompt caching turns re-sent tokens into 0.1x cache reads. On Claude Opus 4.8 our example thread drops from $11.35 to $2.07, an 81.8% reduction.<br>Caching only works if the prefix is byte-stable , so it has to shape the whole agent architecture: tools are exposed as SDK functions in code instead of schemas in the prompt, and every thread is an append-only log.<br>Summarization runs inside the thread's own cache , sending the full history as a 0.1x read instead of paying full price in a separate call.<br>Compaction timing follows the cache lifecycle: never compact a hot thread, compact aggressively in the minutes before the cache goes cold.<br>Every provider's cache behaves differently (explicit breakpoints vs automatic, TTLs, routing), so the thread engine adapts per provider.<br>Viktor is an AI employee that lives in Slack and Microsoft Teams. People hand it real work: triage a support inbox, audit a CRM pipeline, analyze a QA screen recording, build a report. A single task routinely means a thread with dozens of model calls , each one carrying the system prompt, the user's skills and memory, the conversation so far, and a growing pile of tool results.<br>That workload shape has a brutal cost profile if you implement it naively. This post walks through the problem, the math, and the specific architectural decisions inside Viktor's thread engine that keep frontier-model agents economically viable. Everything below comes from our production codebase, and we will keep one concrete example thread running through every calculation, priced on Claude Opus 4.8.<br>1. The problem: LLM APIs have no memory<br>The mental model many people have of a chat with a model is a phone call: an open line where you only transmit the new things you say. The reality is closer to mailing the entire case file to a new consultant every time you have a follow-up question.<br>Model APIs are stateless (mostly: stateful options exist, but if you want to retain full control over what the model sees, you treat them as stateless). There is no session on the provider's side that remembers your conversation. Every single call must contain everything the model needs: the system prompt, the tool definitions, every prior user message, every assistant reply, every tool call and every tool result. When the model answers, you append its reply to your local transcript, and the next call re-sends all of it again, plus the new turn.<br>Every turn re-sends the entire history. The new tokens are the small part; the re-sent tokens dominate.<br>For a human chat with five short turns, this is irrelevant. For an agent it is the whole ballgame, because an agent loop is just a conversation with itself at machine speed: call the model, get a tool call, execute it, append the result, call the model again. Forty steps means forty full re-transmissions of an ever-growing transcript.<br>The cost of this grows quadratically . If your context starts at P tokens and each step appends s tokens, the total input tokens across N calls is roughly N·P + s·N²/2. Double the length of a task and you pay four times as much for the tail.<br>Our running example. A realistic Viktor thread: a 25,000-token stable prefix (system prompt, skills, tool definitions), 40 model calls, and each step appending ~1,500 tokens of tool calls and results. Total input transmitted across the thread: 2,170,000 tokens , even though the final transcript is only ~85,000 tokens long. You send the same early tokens up to 40 times.<br>2. Prompt caching changes the unit economics<br>Providers noticed that virtually all agent traffic looks like this: a long, byte-identical prefix plus a small new suffix. So they built prompt caching: the provider keeps the processed internal state (the KV cache) of your prompt prefix for a short time, and if your next request starts with the exact same bytes, it resumes from the cached state instead of recomputing it.<br>The discount is dramatic. Here is the pricing for Claude Opus 4.8, the model we run for most Viktor threads:<br>Token typePrice / 1M tokensvs. regular inputRegular input$5.001×Cache write (first time a prefix is stored)$6.251.25×Cache read (every subsequent hit)$0.500.1× Output$25.00--<br>You pay a 25% premium once to write a prefix into the cache, and then every read of it costs a tenth of the normal price. In an agent loop where call N+1 re-sends everything from call N, nearly all input tokens become cache reads.<br>Run our example thread through both pricing modes:<br>Cost componentNo cachingWith cachingInput at full price (2,170,000 tok × $5/M)$10.85--Cache writes (83,500 tok × $6.25/M)--$0.52Cache reads (2,086,500 tok × $0.50/M)--$1.04Output (20,000 tok ×...

model tokens thread call cache prompt

Related Articles