Prompt processing vs. generation: two phases, opposite bottlenecks

Prompt Processing vs Generation: Why Your Box Is Fast at One and Slow at the Other

Dark

Here's a result that confuses almost everyone comparing local-LLM hardware: two machines can generate tokens at nearly identical speed, yet one takes three times longer to start replying to a long prompt. Same models, same quant — wildly different feel. People conclude the benchmarks are broken. They're not. They're measuring two different things. Running a local LLM happens in two phases , and they have opposite bottlenecks . Once you understand the split, the entire local-hardware market stops being confusing — you'll know which spec actually matters for your workload, and why a Mac, a Strix Halo box, and a DGX Spark each win and lose at different things. The two phases Every request is processed in two distinct stages: Prompt processing (a.k.a. "prefill" / "reading"). Before the model writes anything, it has to read your entire input — system prompt, document, chat history — and build the keys and values for it. This is the wait before the first word appears: time to first token (TTFT) . Generation (a.k.a. "decode" / "writing"). Then the model produces the reply one token at a time, each step depending on the last. This is the tokens-per-second you watch stream out. They feel similar from the outside — both are "the model working" — but under the hood they stress completely different parts of your hardware. Why generation is limited by memory bandwidth (not compute) This is the single most important idea for buying local-LLM hardware, so here's the intuition. To generate one token, the model must pass that token through every layer — which means reading the relevant model weights out of memory. Then to generate the next token, it reads them all over again . Decode is one token at a time, so the weights get streamed from memory on every single step. That makes generation memory-bandwidth-bound : the bottleneck isn't how fast your chip can do math, it's how fast it can move weights out of memory. The foundational analysis of transformer inference (Pope et al., "Efficiently Scaling Transformer Inference," 2022) makes this precise; the practical upshot is a back-of-envelope rule: max tokens/sec ≈ memory bandwidth ÷ bytes read per token Plug in a 70B model quantized to 4-bit (~40 GB of weights to stream per token):

Memory bandwidthExample hardware class~Ceiling on a 70B Q4

~800 GB/sMac Studio Ultra / high-end GPU~20 tok/s ~256 GB/sStrix Halo / DGX Spark unified memory~6 tok/s ~1000 GB/sRTX 4090-class GDDR~25 tok/s

Notice what's missing from that table: raw compute (TFLOPS). For single-user generation it barely matters — you could double the chip's math throughput and the tokens-per-second would hardly move, because the chip is sitting idle waiting on memory. This is why memory bandwidth is the headline spec for local generation, and why Apple Silicon and unified-memory boxes — which pair big memory with high bandwidth — punch so far above their raw-compute weight. (It's also the deep reason Mixture-of-Experts models generate so fast: only the active parameters get read per token, so "bytes read per token" shrinks dramatically.) Why prompt processing is the opposite: compute-bound Prefill flips the equation. Instead of one token at a time, the model processes all your prompt tokens in parallel — a big matrix-times-matrix multiply. That keeps the math units saturated, so prompt processing is compute-bound : now the FLOPS and tensor cores you ignored for generation are exactly what determine your time to first token. This is why a chip with strong tensor compute (like the GPU inside a DGX Spark) can produce a much faster first token on a long prompt, while a high-bandwidth-but-modest-compute box (a Mac, a Strix Halo) has to grind through it. And it gets worse with context length: prefill cost grows with prompt size, so the gap widens the longer your input. The distinction is so fundamental that researchers showed prefill and decode interfere when mixed (Agrawal et al., SARATHI, 2023), and datacenters now literally run the two phases on different machines with different hardware (Patel et al., Splitwise, 2023). Two phases, two bottlenecks, two ideal chips. How to read a local-LLM benchmark Once you know there are two phases, benchmark numbers stop being noise. Nearly every serious local-LLM benchmark — the llama-bench output people post on r/LocalLLaMA, for instance — reports two figures: pp (prompt processing, often written "pp512") and tg (token generation, "tg128"). The pp number is your prefill/compute speed; the tg number is your decode/bandwidth speed. A box can post a huge pp and a modest tg (compute-rich, bandwidth-limited) or the reverse (a Mac: middling pp, healthy tg). So when someone says a machine "does 40 tok/s," always ask which number — a single figure hides exactly the trade-off that decides whether it fits your workload. The honest comparisons report both, at a stated...

Prompt processing vs. generation: two phases, opposite bottlenecks

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi