Agentically optimizing LLM prompt cache TTLs for fun and profit
Subscribe
A case study on production objective hill climbing<br>Firetiger runs a few hundred large language model (LLM) agents in production, and prompt caching is a critical tool to manage the cost of running such a workload. Properly setting cache time-to-live (TTL), how long a cached prefix survives before the next request pays full price again, is critical to reaping maximum benefit from prompt caching. The catch: the "right" TTL is a property of the workload, and not something you can intuit up front.<br>Case in point: we were quietly burning spend on cache writes that cost more to write than they ever saved us on read. All of this information was spread across a multitude of data sources and generally hard for a human to reason about, let alone methodically improve upon.<br>So! We pointed a Firetiger agent at our own telemetry and bill and gave it one objective: "drive down prompt-cache waste". From there, it hill-climbed against that goal the way every Firetiger agent works: each day it measured where the money is going, recommends config and code changes to improve on performance against our goal, brings humans into the loop to approve fixes, and then re-measures. Over a few weeks that took about 77% off what we were wasting in overly long TTLs and missed cache write opportunities, with a residual that continues to asymptotically converge on zero.<br>Let's dig in on how we made this 77% improvement (with commensurate dollars back in our bank account) happen.<br>A quick word about caches and TTLs<br>Il le spent a chunk of my career running the Cloudflare CDN, which is another way of saying I spent a chunk of my career thinking hard about cache TTLs. A CDN is a giant distributed cache sitting in front of a lot of origin servers, and every cached object carries a TTL, or time to live: hold it for X seconds, or an hour, or a week, and when that TTL expires, go back to the original server it came from to see if it changed. Set the TTL too low and you send more traffic than needed to the origin and pay for the misses; set it too high and you serve something stale, or incur additional admin overhead purging stale assets.<br>Prompt caching for LLMs is a similar problem with the units changed. When you send a prompt to Claude (as an example LLM; each behaves roughly similarly in the prompt caching domain) you can mark a prefix as cacheable, and the next request that shares that prefix reads it back instead of reprocessing it, billed at a fraction of the input price. Anthropic gives you two TTLs, a default five-minute cache and an extended one-hour cache (the prompt caching docs have details).<br>Cache writes aren't free: on Sonnet a write to the five-minute cache runs 1.25x the base input price, a write to the one-hour cache runs 2x, and a read runs 0.1x. So a cache only pays for itself if you get enough reads per write to cover the write premium. On the one-hour tier that break-even is roughly one extra hit per write, and anything below it means the cache is a tax you're paying for nothing.<br>You can't compute the right TTL<br>OK, this seems like a straight forward math problem! We know the write/read prices, we know roughly what our traffic does, and so surely we can just solve for the correct TTL, right? Alas, realized hit rate at a given TTL depends on how we're holding the LLM/what your agent workload actually looks like.<br>And the workload is constantly changing! A new agent ships, a deploy reshuffles a system prompt, a customer onboards, the shape of the access distribution changes.<br>You could try to build a closed-form model to predict how TTL changes will behave. It's hard, and any answer you come up with will be wrong by the next deploy, because the inputs keep changing. The honest move is to run experiments in production: flip a TTL, watch a day of telemetry, keep the change or roll it back. That's hard too, and in a slower way. Doing it methodically across a fleet means babysitting hundreds of (deployment, agent, model) combinations on a rolling schedule, and nobody has that attention to spare. Perhaps we could use an agent to automate this!!!!!<br>The loop<br>Firetiger the company uses Firetiger the product to improve our own product.<br>We stood up an agent we call the Prompt Cache Advisor, and its objective, in plain language, is to find prompt cache waste across every deployment, work out the cheapest TTL for each agent, and show its work. It pulls logs on token usage, computes a read-to-write ratio for every (deployment, agent, model) group, prices each one against the current Anthropic and Bedrock rate cards, and proposes and ranks code changes by how much impact they'd have on cache utilization. Taking these changes into prod is then a question of a human clicking "Fix" on the issue, and "Merge" on the PR.<br>The reports are specific enough to act on. From a real run:<br>switch FT_CACHE_TTL from 5m to 1h, saves ~$1,242/mo (38%). Change: set FT_CACHE_TTL=1h in Terraform...