Deep Dive into LLM Token Cost: How Prompt Caching Works

Deep Dive into LLM Token Cost — Blog Series Part 2: How Prompt Caching Actually Works | My Big Data World

The first post in this series, Part 1: A Real-World Case Study, ended with a single number: a 31-hour Claude Code session that cost $172.58, of which $114.98 — about 66% of the bill — was cache reads. Caching wasn’t a side effect of that session. It was the dominant cost line, by volume so large that it outweighed every other line combined. Anyone trying to reason about LLM cost without a precise mental model of how the cache works is missing the part that matters most.

This post is that mental model. It’s not a strategies post — that’s the next one. It’s the mechanics post for caching specifically: how the prefix match actually behaves on the wire, what happens on the second message of a conversation versus the first, what happens when you walk away for two days, and what really happens to a single 800-token file you read into the session at turn 5.

A note on scope. Claude is used as the worked example throughout because its cache surface is the most explicit — every read, write, and TTL has a corresponding counter in the usage block, so the mechanics can be traced line by line. Where GPT or Gemini diverge in ways that change the practical answer, you’ll find a short cross-provider callout. The third post in this series goes much further on the cross-provider story; here the goal is to get the Claude mechanics rock-solid first.

What’s in this post:

Part 1 — Three questions that reveal how caching actually works. Does every chat message use the cache? Is the full context window really sent on every turn? What happens when you resume a session two days later? The short answers expose the asymmetry that governs almost every Claude cost decision.

Part 2 — A worked example. Following a single 800-token file through its full lifecycle in a 20-turn session: arrival as fresh input, the one-turn-lagged cache write, then cache reads for the rest of the session. Three diagrams make the mechanics tangible. Once you can trace 800 tokens, you can trace anything.

The series. This is the second of three posts:

Part 1: A Real-World Case Study (previous) — the mental model, anchored in a real $172 case study.

Part 2: How Prompt Caching Actually Works (this post) — three questions and one worked example.

Part 3: Strategies and Anti-Patterns (next) — five strategies ranked by impact, the silent failures that undo them, and the cross-provider comparison.

Each post stands on its own. But if you arrived here without reading the first one, the $172 case study is the proof-by-data of why this post matters: caching reads were two-thirds of that bill, and getting your mental model wrong about caching means optimizing the wrong line for the rest of your time on Claude.

Part 1: Three Questions That Reveal How Caching Actually Works

Three questions come up almost immediately the first time someone tries to reason about a prompt cache:

When I send a chat message, does the cache actually get used?

If my context window is 53.6K tokens, is the full payload really sent every turn?

What happens if I pause and resume the session two days later?

The questions sound simple. The answers expose precisely how statelessness, the cache prefix, and the 5-minute TTL interact — and they explain why some usage patterns are cheap and others are surprisingly expensive.

Question 1 — When I send a chat message, does Claude actually use the cache?

Yes — almost certainly, and aggressively. This is what’s happening on every message you send in a tool like Claude Code.

Every time you send a message, the client makes a fresh API call that re-ships the entire conversation so far. The payload looks roughly like:

[ System prompt + tool definitions ] ← large, static, cacheable<br>[ Turn 1: your msg + assistant reply ]<br>[ Turn 2: your msg + assistant reply ]<br>...<br>[ Turn N: your newest message ] ← the only truly new bytes

The client places cache_control breakpoints to mark the static prefix as cacheable. On each new message you send:

System prompt + tool definitions → cache read at ~0.1x. (This block alone is often 10K–20K tokens.)

All prior conversation turns → also cache read at ~0.1x, as long as nothing earlier was edited.

Your newest message → fresh input at 1x.

The assistant’s reply → output at 5x, then becomes part of the cached prefix on your next turn.

So yes, the cache is working hard for you in any back-and-forth chat. The economic shape of the conversation is dominated by the small fresh tail at the end of each turn , not the large repeating prefix.

The 5-minute catch. Cache entries live for 5 minutes from the last read. If you walk away for longer than that between messages, the entry expires. Your next message no longer reads from cache — it pays to rewrite the cache from scratch. We’ll come back to this in Question 3, because it’s where the biggest hidden costs hide.

Cross-provider note. The same "every message uses the...

Deep Dive into LLM Token Cost: How Prompt Caching Works

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy