Prompt Caching: Just do it

Prompt Caching: Just do it | Andre Kreidemann<br>Prompt Caching: Just do it.

Rotterdam, March 29th 2026

Motivation

The first time I looked at Claude Code's token usage stats, one column made me stop: "cached input tokens".

Wait, cached input tokens? How do you even cache something that needs the full context every time to answer?

I put it off as magic and went on with my life. Which was embarrassing. I’d taken every AI and NLP course my masters offered after the ChatGPT launch and figured I had a rough grasp of how this stuff works. Apparently not.

A year later prompt caching came up at work, and this time I actually comitted to understanding it. Turns out almost everyone building agents should be caching aggressively, and the security concerns that I thought were scary are, for most applications, not really concerns at all.

This article is a decision framework that reflects my opinions on that.

What this is

A decision framework for prompt caching. When to cache what in your LLM application, with the security tradeoffs at each step. I'll explain just enough about how caching works to make the reasoning make sense, and then walk through the actual decisions.

What this is not

An in-depth explanation of prompt caching. I'll just cover the basics needed for my points. Understanding prompt caching deeply, especially on a technical level, requires a much longer read. Two of my favourites: Sankalp Shubham's walkthrough and Sam Rose's ngrok post.

A guide to optimise cache hits. Maybe I will write a post about this in the future, but there are great guides on this already and the provider docs are very good nowadays.

A provider comparison. I'm not going to walk through every provider's caching implementation, pricing, and TTLs. Those might change at any point, and the security reasoning applies regardless of which provider you use, given you trust the provider enough to use it in the first place.

What even is prompt caching?

To understand prompt caching, you first need to understand what it's caching, and to understand that, you have to zoom out to how LLM inference works at a high level.

When you send a request to a model, the part of the inference that matters here can be split into two stages:

Prefill phase : The model processes all your input tokens at once, building up an internal representation of the input.

Decode phase : The model generates output tokens one at a time, each one attending to everything that came before it.

Even though the prefill phase is "just" one big linear algebra calculation, the scale of these matrix multiplications can get enormous and scales quadratically with the input length. So for longer prompts the prefill phase takes far more total compute compared to generating a few hundred output tokens, even though each individual decode step is more expensive. A 100k token prompt means 100k tokens worth of computation before the model even starts responding.

Without caching, every request re-runs the full prefill. With caching, only the new suffix has to be computed.<br>This is where prompt caching enters. During the prefill phase, the model computes what are called key-value pairs (KV pairs) for each token. Without getting into the linear algebra, think of it this way: each token's KV pair encodes what that token "knows" about every token that came before it. These pairs are what the model uses during attention — where each token looks back at every earlier token to figure out what's relevant. They're the mechanism by which the model understands context.

Now, the only thing we really need to understand here for prompt caching: In a decoder-only transformer (which is what Claude, GPT, and most modern LLMs are), attention only flows forward. Token 50 attends to tokens 1-49, but tokens 1-49 don't need to know anything about token 50. That means if you've already computed the KV pairs for tokens 1-49, and a new request comes in with those same 49 tokens at the start, the cached KV pairs are still valid. Nothing in them was ever going to look forward at the new tokens, so you can reuse them as-is. That's what prompt caching is: skipping the prefill computation for the prefix of your prompt that hasn't changed.

This only works because caching is prefix-based . The cache checks your prompt from the very first token, moving forward sequentially. As long as the tokens match a previously cached prefix, you get a cache hit for that portion and only compute the rest. The moment the tokens diverge, the cache stops matching. This means order matters, because the exact same content rearranged in a different order is a complete cache miss.

Every turn keeps the earlier tokens identical; only the latest message at the end differs. A perfect fit for a prefix cache.<br>This is why prompt caching is particularly powerful for multi-turn conversations and agentic workflows. Think about a typical API request: it starts with tool definitions, then your system prompt, then the full message history,...

Prompt Caching: Just do it

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews