Saving Money on Inference

Saving Money on Inference · Merrilin.ai Blog↓ Skip to main content

Merrilin.ai Blog

Merrilin

Table of Contents

The repeated-prefix problem A normal chat asks one question and gets one answer. A reading assistant does something more expensive: it keeps dragging the same context forward. Every follow-up needs the system prompt, book metadata, retrieved passages, reader state, previous answers, and whatever the user is referring to with words like “that” or “they”. That means the expensive part of an agentic conversation is often not the new question. It is paying for the model to reread the same prefix over and over. We care about this because Merrilin is still a passion project we are funding ourselves. That does not mean we want to make the product timid or ration the parts that make it useful. It means the opposite: if we can stop wasting money on repeated context, we can afford more reading sessions, more experiments, and better models where they actually matter. The problem is the prefix, not the follow-up Each row lines up the chat turn with what the provider has to process. Tap or hover a row to see why a tiny follow-up can still carry a large hidden cost. fresh input without cache cached-read prefix new tail

Prompt caching is the provider-level version of not recomputing that same prefix. The model still sees the full conversation, but the billing and compute path changes: repeated context becomes a cached read, and only the new tail of the prompt is treated as expensive fresh input. What the model is caching To see why that is possible, we need to look at what the model is actually caching. Modern transformer architectures use what is known as attention to predict the next token based on a given input. Each layer of the transformer needs to compute three values, which are then used by the attention block. For an input sequence packed into a matrix $X \in \mathbb{R}^{n \times d}$ (one row per token, each row a $d$-dimensional embedding), the layer learns three weight matrices $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ and projects the input through each of them to produce $Q$, $K$, and $V$: Tap or hover a token to highlight its row: The cat sat on the mat Q ∈ ℝ6×6

$$Q = X \, W_Q$$

K ∈ ℝ6×6

$$K = X \, W_K$$

V ∈ ℝ6×6

$$V = X \, W_V$$

Each row of Q, K, and V corresponds to one input token — the same row index across all three matrices represents the same token.$Q$ (queries) is what the current token is “looking for”, $K$ (keys) is what each token “advertises” about itself, and $V$ (values) is the actual content that gets mixed into the output. Attention then combines them: $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$That is the calculation for just one layer of a transformer, and LLMs have multiple layers. The output of one layer becomes the input $X$ of the next, and each layer has to compute its own Q/K/V matrices. Tap or hover any layer to focus on it.

input → → → data flows through all 32 layers → → → output A 32-layer transformer (only 3 layers shown — there are 29 more between L2 and L32). Each layer has its own attention block with its own K and V; the FFN sits between layers. Only K and V are cached. Cache cost grows as 2 × L × n × dk; for a model like DeepSeek V4 Pro (61 layers) at long context, this is the dominant memory bottleneck of inference.While we can’t save the Q matrices from each layer, we can save the K/V cache and only compute K/V for the new tokens in the prompt. Cached prefix length:

6 cached 2 new K (this layer)

V (this layer)

from cache (no compute) computed now (WK·x, WV·x) Drag the slider to change how many tokens are already cached. Cached rows skip both matmuls (WK·x and WV·x) — and that saving applies per layer. For a 32-layer model with a 4k-token cached prefix, that's 256,000 matmuls skipped per generation step.This is all great but you might be wondering, why cache these values? Isn’t it just better to compute them? Well, no. Compute is expensive these days, especially when you need to do expensive $O(n^2)$ matrix multiplications. For example, recomputing the KV cache for one layer in Llama 70B with, let’s say, 20,000 input tokens is a 20,000 × 8,192 matrix (the input X) multiplied by an 8,192 × 1,024 weight matrix (WK — Llama 70B uses Grouped Query Attention, which shrinks dkv from 8,192 down to 1,024). That single matmul is about 336 GFLOPs for K, another 336 for V — call it ~672 GFLOPs per layer. Stack 80 layers and you’re at roughly 54 TFLOPs of compute just to populate the KV state for one forward pass. On an H100 at peak BF16 (~990 TFLOPs/s), that’s around 55 ms . Doable once. But imagine doing that for each iteration of your chat. Let’s say you’re adding about 500 tokens each turn; that means you’ve burned about ~27 seconds of H100 time on K/V projections alone,...

Saving Money on Inference

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy