The surprising depths of prompt caching

The surprising depths of prompt caching · opub

We can't find the internet

Attempting to reconnect

Something went wrong!

Attempting to reconnect

Tokens sure are expensive, and they probably won't get cheaper any time soon. Especially when you're pulling from a pool of donated tokens, making efficient use of what's available to you is important.

The first real lever many teams discover for significant LLM spend reduction is prompt caching . Effective caching strategies can save your team massive sums of money.

What is it, how it works, best practices - we'll get into all that. We'll also explore radix trees as a caching method, and why frontier model providers probably won't be too upfront about it if they use them. And it seems like they are.

It'll be fun, no math! We promise.

Prompt bit on caching

To start, a clean definition:

Prompt caching skips repeated re-computation of the expensive part of LLM inference — the key-value attention states — for tokens that haven't changed between requests. Instead of re-reading your system prompt from scratch on every call, the server loads the precomputed math and picks up from there. You pay only for what's new.

If your system prompt says: "Please respond exclusively as a pirate on the cusp of scurvy", the attention states, "the path" through the model, is pre-primed by the server.

That means subsequent prompts already "know" the prompt, and will "skip" to what's new: yaaarrr me treasure for a melon, ye' salty dog!

Tokenomics

Prompt caching is a source of significant cost savings. It reduces the number of tokens that need to be crunched.

You'll see caching information referenced by your coding harness and API responses, something like this in the case of Codex:

Token usage: total=647,414 input=615,643 (+ 7,505,280 cached) output=31,771 (reasoning 4,583)

Using OpenAI's current pricing for GPT-5.5, we see the cost benefit of cache:

Bucket

Tokens

Rate

Cost

Input (uncached)

615,643

$5.00/M

$3.08

Input (cached)

7,505,280

$0.50/M

$3.75

Output (incl. reasoning)

31,771

$30.00/M

$0.95

Total

$7.78

Not bad! In a vacuum anyways.

What would it cost without caching?

Bucket

Tokens

Rate

Cost

All input at full price

8,120,923

$5.00/M

$40.60

Output

31,771

$30.00/M

$0.95

Total

$41.55

Ooof! 💰 That's some cost savings:

Cost

With cache

$7.78

Without cache

$41.55

Saved

$33.77 (81% off)

If your application, team, or agent sends the same long system prompt, tool schema, coding instructions, policy text, or repository context thousands of times per day, you are probably paying the model provider to re-read the same tokens over and over. You don't have to do that.

For now, there's an aligned incentive between us, the developers, and the frontier providers, to reduce straight inference hits. Model providers are trying to scale to match massive demand, and we're trying to save some major dollars. Efficiency is key to all of this truly working, on both sides.

But the discount is only a part of the story. An interesting question is why that repeated work is necessary in the first place.

Models don't read

A transformer, your frontier model, does not simply read text and remember the words. During the prompt-processing step, text is split into tokens, and each token is turned into vectors. No reading involved.

Claude explaining this for the 24th time before (probably) trying to steal my rings

A token starts as a small integer ID, like:

"helpful" -> token id 15345

The model first turns that ID into a dense vector, maybe thousands of numbers wide:

15345 -> [0.0312, -0.8741, 0.2209, ...]

Then every transformer layer computes attention state for that token. For caching, the important outputs are the token's Key and Value vectors at each layer.

So one token does not become one thing. It becomes something like:

token layer 1: K vector + V vector layer 2: K vector + V vector layer 3: K vector + V vector ... layer N: K vector + V vector

A 2,000-token prompt is not just 2,000 wee little text fragments. It is 2,000 positions worth of mathematical state, repeated across dozens of layers. Each K/V vector can be hundreds or thousands of numbers depending on the architecture.

Multiply that by every token in the prompt, every request, and every concurrent user, and the "same system prompt" becomes a large block of GPU memory and prefill compute. That compute is what gets - and will keep getting - costly.

It's baked right into the expansion:

text tokens -> embeddings -> per-layer K/V tensors

The KV cache is the model server keeping those per-token, per-layer Key/Value tensors around.

So what do we cache?

The phrase "prompt caching" sounds like a text cache. It's something different.

By the time caching matters, the prompt has already expanded into per-token, per-layer KV state. During normal generation, the model server already keeps that state around so new output tokens can attend back to the prompt without recomputing the whole...

The surprising depths of prompt caching

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits