Inference cost at scale with napkin math

gmays1 pts0 comments

Inference cost at scale with napkin math

Inference cost at scale with napkin math

Jun 14

gpu<br>ai<br>math

If you serve AI models as in your product's stack (hard not to in 2026),<br>you've likely done the math on how much juice you can get out of a single GPU.<br>For monthly subscription products, this metric directly affects the pricing.

With some rudimentary knowledge about the hardware you're operating,<br>you can do a ballpark estimation of how much each user costs you in dollars.<br>For this article, we'll assume knowledge of the following:

GPU hardware specs : Memory bandwidth and peak throughput (explained below).

Context length: assumed 200k tokens.

Active parameter count of the model: Assumed 32B to keep things simple on a single GPU.

Some idea about your product : Whether it's driven by user prompts or programmed loops, duty cycle of your user profile (explained at the end), etc.

If you're comfortable/familiar with the architecture of LLMs, use this legend to skip to the sections that interest you:

Resources on a single GPU

Cost of a Matrix Multiplication

An Overview of Language Models

Attention in Greater Detail

Reducing Compute with KV-Cache

How much does a token cost?

How many users can you serve realistically?

Optimizing for hundreds of users on a GPU

Tokens Per Second

Dollar cost per user

Resources on a single GPU

For any GPU on the market, you can find on its spec sheet:

Peak throughput: Number of floating-point operations executed per second. Usually in TeraFLOPs<br>(1 TFLOP/s = \(10^12\) ops/sec).

Memory bandwidth : Amount of data that can be moved from global memory (VRAM) to registers (SRAM).Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput,<br>though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

If you bothered to click on this article<br>you know that AI models do many matrix multiplications on massive matrices.<br>That we start by finding the cost of a matmul should be no surprise then.

Assume two matrices: \(A_{N \times d} \) and \(B_{d \times M}\).<br>Let their product be the matrix \( O_{N \times M} \).<br>From high school algebra, we know that each element of \(O\) can be computed as:

$$<br>O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k}<br>$$

In this, we find our first insight into the "cost" of a matrix multiplication. For each \( O^{i,k}\), we need to start with an initial value of 0 and:

Load \(A^{i,j}\) from memory.

Load (B^{j,k}) from memory.

Multiply them together.

Add the result of #3 to the cumulative sum so far.

And this is done a total of \(d\) times per item.<br>So, the cost of a (N,d)*(d,M) matrix product<br>is \( 2NMd \) memory accesses and \(2NMd\) floating-point operations.

With an optimization called tiling,<br>the memory access goes down to about \( d(N+M) \).<br>The details aren't necessary to proceed, but Alvin's blog post<br>has them for those curious.

An Overview of Language Models.

At their core, LLMs are simple –<br>they receive a sequence of N words and generate the N+1th.<br>Each word is represented as a vector with d entries.<br>Using repeated applications of a function called "attention" (explained later), they predict the next word.

A single forward pass roughly looks like this:

y = input() # y = matrix of size N x d<br>for each layer in the network:<br>y = attention(y)

# Convert the final layer's output to word-probs.<br># W_vocab = matrix of size d x vocab_len,<br># and vocab_len is the number of all words<br># in the model's vocabulary.<br>token_probs = softmax(y * W_vocab)<br>next_tok = token_probs(argmax(token_probs))<br># next_tok is a (1 x d) vector

This is also why LLMs are called auto-regressive. They can keep doing multiple forward passes over their own output until a token is generated.

This is a simplified overview of where I'm skipping RoPE,<br>the MLP layers in between, token sampling at the end,<br>and much more.<br>As mentioned earlier,<br>you can add those in and still verify that our math will work out<br>by a Fermi estimation.

Attention in Greater Detail

Let's place the attention function under a magnifying glass.

As you saw, the input is a matrix \(X \in \mathbb{R}^{N \times d}\), and \(X_i\) is a single \(d\) dimensional vector.<br>For every "layer" in the network, the model stores matrices \( W_Q,W_K, W_V \in \mathbb{R}^{d \times d} \), and computes "attention" as follows:

\( Q = X.W_Q \), \( K = X.W_K\) and \( V = X.W_v\)

\( Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V\)

Or, in python:

def attention(X, W_q, W_k, W_v):<br>Q,K,V = X @ W_q, X @ W_k, X @ W_v<br>Q_KT = Q @ K.transpose(2,1)<br>return softmax(Q_KT / sqrt(d_model)) @ V

Where @ is the dot-product of two matrices.

In reality, multiple LLM conversations are processed in parallel.<br>So inference is batched—where we process \(B\) chats concurrently.<br>This means our input sequence \( X \in \mathbb{R}_{B \times N \times d}\).

Work the math out on paper to verify it tracks.

In our Python code, just the transpose arguments change:

- Q_KT = Q @ K.transpose(2, 1)<br>+ Q_KT = Q @ K.transpose(0,...

cost matrix math attention times memory

Related Articles