Saving Money on Inference

stonecharioteer1 pts0 comments

Saving Money on Inference · Merrilin.ai Blog&darr;<br>Skip to main content

Merrilin.ai Blog

Merrilin

Table of Contents

Table of Contents

The repeated-prefix problem<br>A normal chat asks one question and gets one answer. A reading assistant does something more<br>expensive: it keeps dragging the same context forward. Every follow-up needs the system prompt, book<br>metadata, retrieved passages, reader state, previous answers, and whatever the user is referring to<br>with words like &ldquo;that&rdquo; or &ldquo;they&rdquo;.<br>That means the expensive part of an agentic conversation is often not the new question. It is paying<br>for the model to reread the same prefix over and over.<br>We care about this because Merrilin is still a passion project we are funding ourselves. That does<br>not mean we want to make the product timid or ration the parts that make it useful. It means the<br>opposite: if we can stop wasting money on repeated context, we can afford more reading sessions,<br>more experiments, and better models where they actually matter.<br>The problem is the prefix, not the follow-up<br>Each row lines up the chat turn with what the provider has to process. Tap or hover a row to see why a tiny follow-up can still carry a large hidden cost.<br>fresh input without cache<br>cached-read prefix<br>new tail

Prompt caching is the provider-level version of not recomputing that same prefix. The model still<br>sees the full conversation, but the billing and compute path changes: repeated context becomes a<br>cached read, and only the new tail of the prompt is treated as expensive fresh input.<br>What the model is caching<br>To see why that is possible, we need to look at what the model is actually caching.<br>Modern transformer architectures use what is known as attention to predict the next token based on a<br>given input. Each layer of the transformer needs to compute three values, which are then used by the<br>attention block.<br>For an input sequence packed into a matrix \(X \in \mathbb{R}^{n \times d}\) (one row per token,<br>each row a \(d\)-dimensional embedding), the layer learns three weight matrices \(W_Q, W_K, W_V \in<br>\mathbb{R}^{d \times d_k}\) and projects the input through each of them to produce \(Q\), \(K\), and<br>\(V\):<br>Tap or hover a token to highlight its row:<br>The<br>cat<br>sat<br>on<br>the<br>mat<br>Q ∈ ℝ6×6

$$Q = X \, W_Q$$

K ∈ ℝ6×6

$$K = X \, W_K$$

V ∈ ℝ6×6

$$V = X \, W_V$$

Each row of Q, K, and V corresponds to one input token — the same row index across all three matrices represents the same token.\(Q\) (queries) is what the current token is &ldquo;looking for&rdquo;, \(K\) (keys) is what each token<br>&ldquo;advertises&rdquo; about itself, and \(V\) (values) is the actual content that gets mixed into the output.<br>Attention then combines them:<br>$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$That is the calculation for just one layer of a transformer, and LLMs have multiple layers. The<br>output of one layer becomes the input \(X\) of the next, and each layer has to compute its own<br>Q/K/V matrices.<br>Tap or hover any layer to focus on it.

input<br>→ → → data flows through all 32 layers → → →<br>output<br>A 32-layer transformer (only 3 layers shown — there are 29 more between L2 and L32). Each layer has its own attention block with its own K and V; the FFN sits between layers. Only K and V are cached. Cache cost grows as 2 × L × n × dk; for a model like DeepSeek V4 Pro (61 layers) at long context, this is the dominant memory bottleneck of inference.While we can&rsquo;t save the Q matrices from each layer, we can save the K/V cache and only compute K/V<br>for the new tokens in the prompt.<br>Cached prefix length:

6 cached<br>2 new<br>K (this layer)

V (this layer)

from cache (no compute)<br>computed now (WK·x, WV·x)<br>Drag the slider to change how many tokens are already cached. Cached rows skip both matmuls (WK·x and WV·x) — and that saving applies per layer. For a 32-layer model with a 4k-token cached prefix, that's 256,000 matmuls skipped per generation step.This is all great but you might be wondering, why cache these values? Isn&rsquo;t it just better to<br>compute them?<br>Well, no. Compute is expensive these days, especially when you need to do expensive \(O(n^2)\)<br>matrix multiplications. For example, recomputing the KV cache for one layer in Llama 70B with, let&rsquo;s<br>say, 20,000 input tokens is a 20,000 × 8,192 matrix (the input X) multiplied by an 8,192 × 1,024<br>weight matrix (WK — Llama 70B uses Grouped Query Attention, which shrinks dkv<br>from 8,192 down to 1,024). That single matmul is about 336 GFLOPs for K, another 336 for V —<br>call it ~672 GFLOPs per layer.<br>Stack 80 layers and you&rsquo;re at roughly 54 TFLOPs of compute just to populate the KV state for one<br>forward pass. On an H100 at peak BF16 (~990 TFLOPs/s), that&rsquo;s around 55 ms . Doable once. But<br>imagine doing that for each iteration of your chat. Let&rsquo;s say you&rsquo;re adding about 500 tokens each<br>turn; that means you&rsquo;ve burned about ~27 seconds of H100 time on K/V projections alone,...

layer input cached compute token rsquo

Related Articles