Saving Money on Inference · Merrilin.ai Blog↓<br>Skip to main content
Merrilin.ai Blog
Merrilin
Table of Contents
Table of Contents
The repeated-prefix problem<br>A normal chat asks one question and gets one answer. A reading assistant does something more<br>expensive: it keeps dragging the same context forward. Every follow-up needs the system prompt, book<br>metadata, retrieved passages, reader state, previous answers, and whatever the user is referring to<br>with words like “that” or “they”.<br>That means the expensive part of an agentic conversation is often not the new question. It is paying<br>for the model to reread the same prefix over and over.<br>We care about this because Merrilin is still a passion project we are funding ourselves. That does<br>not mean we want to make the product timid or ration the parts that make it useful. It means the<br>opposite: if we can stop wasting money on repeated context, we can afford more reading sessions,<br>more experiments, and better models where they actually matter.<br>The problem is the prefix, not the follow-up<br>Each row lines up the chat turn with what the provider has to process. Tap or hover a row to see why a tiny follow-up can still carry a large hidden cost.<br>fresh input without cache<br>cached-read prefix<br>new tail
Prompt caching is the provider-level version of not recomputing that same prefix. The model still<br>sees the full conversation, but the billing and compute path changes: repeated context becomes a<br>cached read, and only the new tail of the prompt is treated as expensive fresh input.<br>What the model is caching<br>To see why that is possible, we need to look at what the model is actually caching.<br>Modern transformer architectures use what is known as attention to predict the next token based on a<br>given input. Each layer of the transformer needs to compute three values, which are then used by the<br>attention block.<br>For an input sequence packed into a matrix \(X \in \mathbb{R}^{n \times d}\) (one row per token,<br>each row a \(d\)-dimensional embedding), the layer learns three weight matrices \(W_Q, W_K, W_V \in<br>\mathbb{R}^{d \times d_k}\) and projects the input through each of them to produce \(Q\), \(K\), and<br>\(V\):<br>Tap or hover a token to highlight its row:<br>The<br>cat<br>sat<br>on<br>the<br>mat<br>Q ∈ ℝ6×6
$$Q = X \, W_Q$$
K ∈ ℝ6×6
$$K = X \, W_K$$
V ∈ ℝ6×6
$$V = X \, W_V$$
Each row of Q, K, and V corresponds to one input token — the same row index across all three matrices represents the same token.\(Q\) (queries) is what the current token is “looking for”, \(K\) (keys) is what each token<br>“advertises” about itself, and \(V\) (values) is the actual content that gets mixed into the output.<br>Attention then combines them:<br>$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$That is the calculation for just one layer of a transformer, and LLMs have multiple layers. The<br>output of one layer becomes the input \(X\) of the next, and each layer has to compute its own<br>Q/K/V matrices.<br>Tap or hover any layer to focus on it.
input<br>→ → → data flows through all 32 layers → → →<br>output<br>A 32-layer transformer (only 3 layers shown — there are 29 more between L2 and L32). Each layer has its own attention block with its own K and V; the FFN sits between layers. Only K and V are cached. Cache cost grows as 2 × L × n × dk; for a model like DeepSeek V4 Pro (61 layers) at long context, this is the dominant memory bottleneck of inference.While we can’t save the Q matrices from each layer, we can save the K/V cache and only compute K/V<br>for the new tokens in the prompt.<br>Cached prefix length:
6 cached<br>2 new<br>K (this layer)
V (this layer)
from cache (no compute)<br>computed now (WK·x, WV·x)<br>Drag the slider to change how many tokens are already cached. Cached rows skip both matmuls (WK·x and WV·x) — and that saving applies per layer. For a 32-layer model with a 4k-token cached prefix, that's 256,000 matmuls skipped per generation step.This is all great but you might be wondering, why cache these values? Isn’t it just better to<br>compute them?<br>Well, no. Compute is expensive these days, especially when you need to do expensive \(O(n^2)\)<br>matrix multiplications. For example, recomputing the KV cache for one layer in Llama 70B with, let’s<br>say, 20,000 input tokens is a 20,000 × 8,192 matrix (the input X) multiplied by an 8,192 × 1,024<br>weight matrix (WK — Llama 70B uses Grouped Query Attention, which shrinks dkv<br>from 8,192 down to 1,024). That single matmul is about 336 GFLOPs for K, another 336 for V —<br>call it ~672 GFLOPs per layer.<br>Stack 80 layers and you’re at roughly 54 TFLOPs of compute just to populate the KV state for one<br>forward pass. On an H100 at peak BF16 (~990 TFLOPs/s), that’s around 55 ms . Doable once. But<br>imagine doing that for each iteration of your chat. Let’s say you’re adding about 500 tokens each<br>turn; that means you’ve burned about ~27 seconds of H100 time on K/V projections alone,...