Inference cost at scale with napkin math
Inference cost at scale with napkin math
Jun 14
gpu<br>ai<br>math
If you serve AI models as in your product's stack (hard not to in 2026),<br>you've likely done the math on how much juice you can get out of a single GPU.<br>For monthly subscription products, this metric directly affects the pricing.
With some rudimentary knowledge about the hardware you're operating,<br>you can do a ballpark estimation of how much each user costs you in dollars.<br>For this article, we'll assume knowledge of the following:
GPU hardware specs : Memory bandwidth and peak throughput (explained below).
Context length: assumed 200k tokens.
Active parameter count of the model: Assumed 32B to keep things simple on a single GPU.
Some idea about your product : Whether it's driven by user prompts or programmed loops, duty cycle of your user profile (explained at the end), etc.
If you're comfortable/familiar with the architecture of LLMs, use this legend to skip to the sections that interest you:
Resources on a single GPU
Cost of a Matrix Multiplication
An Overview of Language Models
Attention in Greater Detail
Reducing Compute with KV-Cache
How much does a token cost?
How many users can you serve realistically?
Optimizing for hundreds of users on a GPU
Tokens Per Second
Dollar cost per user
Resources on a single GPU
For any GPU on the market, you can find on its spec sheet:
Peak throughput: Number of floating-point operations executed per second. Usually in TeraFLOPs<br>(1 TFLOP/s = \(10^12\) ops/sec).
Memory bandwidth : Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.
We'll assume FP-8 quantization to compute throughput,<br>though it's easy to adjust the math for FP-16 as well.
Cost of a Matrix Multiplication
If you bothered to click on this article<br>you know that AI models do many matrix multiplications on massive matrices.<br>That we start by finding the cost of a matmul should be no surprise then.
Assume two matrices: \(A_{N \times d} \) and \(B_{d \times M}\).<br>Let their product be the matrix \( O_{N \times M} \).<br>From high school algebra, we know that each element of \(O\) can be computed as:
$$<br>O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k}<br>$$
In this, we find our first insight into the "cost" of a matrix multiplication. For each \( O^{i,k}\), we need to start with an initial value of 0 and:
Load \(A^{i,j}\) from memory.
Load (B^{j,k}) from memory.
Multiply them together.
Add the result of #3 to the cumulative sum so far.
And this is done a total of \(d\) times per item.<br>So, the cost of a (N,d)*(d,M) matrix product<br>is \( 2NMd \) memory accesses and \(2NMd\) floating-point operations.
With an optimization called tiling,<br>the memory access goes down to about \( d(N+M) \).<br>The details aren't necessary to proceed, but Alvin's blog post<br>has them for those curious.
An Overview of Language Models.
At their core, LLMs are simple –<br>they receive a sequence of N words and generate the N+1th.<br>Each word is represented as a vector with d entries.<br>Using repeated applications of a function called "attention" (explained later), they predict the next word.
A single forward pass roughly looks like this:
y = input() # y = matrix of size N x d<br>for each layer in the network:<br>y = attention(y)
# Convert the final layer's output to word-probs.<br># W_vocab = matrix of size d x vocab_len,<br># and vocab_len is the number of all words<br># in the model's vocabulary.<br>token_probs = softmax(y * W_vocab)<br>next_tok = token_probs(argmax(token_probs))<br># next_tok is a (1 x d) vector
This is also why LLMs are called auto-regressive. They can keep doing multiple forward passes over their own output until a token is generated.
This is a simplified overview of where I'm skipping RoPE,<br>the MLP layers in between, token sampling at the end,<br>and much more.<br>As mentioned earlier,<br>you can add those in and still verify that our math will work out<br>by a Fermi estimation.
Attention in Greater Detail
Let's place the attention function under a magnifying glass.
As you saw, the input is a matrix \(X \in \mathbb{R}^{N \times d}\), and \(X_i\) is a single \(d\) dimensional vector.<br>For every "layer" in the network, the model stores matrices \( W_Q,W_K, W_V \in \mathbb{R}^{d \times d} \), and computes "attention" as follows:
\( Q = X.W_Q \), \( K = X.W_K\) and \( V = X.W_v\)
\( Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V\)
Or, in python:
def attention(X, W_q, W_k, W_v):<br>Q,K,V = X @ W_q, X @ W_k, X @ W_v<br>Q_KT = Q @ K.transpose(2,1)<br>return softmax(Q_KT / sqrt(d_model)) @ V
Where @ is the dot-product of two matrices.
In reality, multiple LLM conversations are processed in parallel.<br>So inference is batched—where we process \(B\) chats concurrently.<br>This means our input sequence \( X \in \mathbb{R}_{B \times N \times d}\).
Work the math out on paper to verify it tracks.
In our Python code, just the transpose arguments change:
- Q_KT = Q @ K.transpose(2, 1)<br>+ Q_KT = Q @ K.transpose(0,...