Inference cost at scale with napkin math

Jun 14

gpu ai math

If you serve AI models as in your product's stack (hard not to in 2026), you've likely done the math on how much juice you can get out of a single GPU. For monthly subscription products, this metric directly affects the pricing.

With some rudimentary knowledge about the hardware you're operating, you can do a ballpark estimation of how much each user costs you in dollars. For this article, we'll assume knowledge of the following:

GPU hardware specs : Memory bandwidth and peak throughput (explained below).

Context length: assumed 200k tokens.

Active parameter count of the model: Assumed 32B to keep things simple on a single GPU.

Some idea about your product : Whether it's driven by user prompts or programmed loops, duty cycle of your user profile (explained at the end), etc.

If you're comfortable/familiar with the architecture of LLMs, use this legend to skip to the sections that interest you:

Resources on a single GPU

Cost of a Matrix Multiplication

An Overview of Language Models

Attention in Greater Detail

Reducing Compute with KV-Cache

How much does a token cost?

How many users can you serve realistically?

Optimizing for hundreds of users on a GPU

Tokens Per Second

Dollar cost per user

Resources on a single GPU

For any GPU on the market, you can find on its spec sheet:

Peak throughput: Number of floating-point operations executed per second. Usually in TeraFLOPs (1 TFLOP/s = $10^12$ ops/sec).

Memory bandwidth : Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

If you bothered to click on this article you know that AI models do many matrix multiplications on massive matrices. That we start by finding the cost of a matmul should be no surprise then.

Assume two matrices: $A_{N \times d} $ and $B_{d \times M}$. Let their product be the matrix $ O_{N \times M} $. From high school algebra, we know that each element of $O$ can be computed as:

$$ O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k} $$

In this, we find our first insight into the "cost" of a matrix multiplication. For each $ O^{i,k}$, we need to start with an initial value of 0 and:

Load $A^{i,j}$ from memory.

Load (B^{j,k}) from memory.

Multiply them together.

Add the result of #3 to the cumulative sum so far.

And this is done a total of $d$ times per item. So, the cost of a (N,d)*(d,M) matrix product is $ 2NMd $ memory accesses and $2NMd$ floating-point operations.

With an optimization called tiling, the memory access goes down to about $ d(N+M) $. The details aren't necessary to proceed, but Alvin's blog post has them for those curious.

An Overview of Language Models.

At their core, LLMs are simple – they receive a sequence of N words and generate the N+1th. Each word is represented as a vector with d entries. Using repeated applications of a function called "attention" (explained later), they predict the next word.

A single forward pass roughly looks like this:

y = input() # y = matrix of size N x d for each layer in the network: y = attention(y)

# Convert the final layer's output to word-probs. # W_vocab = matrix of size d x vocab_len, # and vocab_len is the number of all words # in the model's vocabulary. token_probs = softmax(y * W_vocab) next_tok = token_probs(argmax(token_probs)) # next_tok is a (1 x d) vector

This is also why LLMs are called auto-regressive. They can keep doing multiple forward passes over their own output until a token is generated.

This is a simplified overview of where I'm skipping RoPE, the MLP layers in between, token sampling at the end, and much more. As mentioned earlier, you can add those in and still verify that our math will work out by a Fermi estimation.

Attention in Greater Detail

Let's place the attention function under a magnifying glass.

As you saw, the input is a matrix $X \in \mathbb{R}^{N \times d}$, and $X_i$ is a single $d$ dimensional vector. For every "layer" in the network, the model stores matrices $ W_Q,W_K, W_V \in \mathbb{R}^{d \times d} $, and computes "attention" as follows:

$ Q = X.W_Q $, $ K = X.W_K$ and $ V = X.W_v$

$ Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V$

Or, in python:

def attention(X, W_q, W_k, W_v): Q,K,V = X @ W_q, X @ W_k, X @ W_v Q_KT = Q @ K.transpose(2,1) return softmax(Q_KT / sqrt(d_model)) @ V

Where @ is the dot-product of two matrices.

In reality, multiple LLM conversations are processed in parallel. So inference is batched—where we process $B$ chats concurrently. This means our input sequence $ X \in \mathbb{R}_{B \times N \times d}$.

Work the math out on paper to verify it tracks.

In our Python code, just the transpose arguments change:

- Q_KT = Q @ K.transpose(2, 1) + Q_KT = Q @ K.transpose(0,...

Inference cost at scale with napkin math

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews