Inference Cards

zdw1 pts0 comments

Inference Cards | cmart's blog

Inference Cards

Jun 25, 2026

Why

skip past the why

When someone says “I run Qwen 3.6 at 25 tokens per second”, or makes any similar performance claim about their self-hosted LLM setup, this is only meaningful if we know several other things.

Which model variant? Qwen 3.6 could be the dense 27B or the 35B-A3B MoE, totally different architectures. Better, just link to the repo you downloaded the weights from.

Which quantization? Q8, Q4_K_XL, and IQ3_XXS are at different points in quality/speed/size space.

What hardware and inference engine? Is this vLLM on an H100 or llama.cpp on a Raspberry Pi?

I could ask if you’re using speculative decoding or various weird stuff, but better, show the command you used to invoke the inference engine .

How did you measure the speed? Are you exercising an HTTP API (which adds a possibly-large chat template in the context window), or using something like llama-bench (which skips the template and HTTP/network delay)? Or better, what command did you use to run the test?

Also, knowing only the generation (i.e. decode) speed at a shallow context depth is not enough to understand whether agentic workloads will be usable on a given setup. Prefill (i.e. prompt processing) speed matters because agents spend a lot of time reading stuff. It also matters how speeds change as context depth increases , because agents do most of their work with tens of thousands of tokens in the context window. Also, if you’re trying to serve multi-agent (or multi-user) workloads, it matters how these numbers change with multiple concurrent requests . (And no, you cannot guesstimate any of these other numbers from “25 TPS generation speed” because different hardware and inference engines all have different performance characteristics in this several-dimensional space.)

With this fuller picture, we can more reasonably compare your computer to my computer. We can talk about which workloads are usable interactively and which will crawl at “run overnight” speed. We can spot when something is broken, and reasonably ask “Does this change make it faster?”, knowing what “it” even was to begin with. We also get a sense of what quality of output to expect from the LLM.

In online communities for self-hosted inference, most people don’t bother to communicate most of this information, and the quality of discussion suffers! We need a compact, easy way to share so that more people will do it.

Now I follow in some big footsteps to propose a deliberately under-specified plaintext markup format. I hope it is highly readable and easy for new people to pick up.

Inference Cards

Think of baseball cards, but for computers running LLMs. An inference card shows the most important information to understand setup and performance. You can share them in a code block, or as a screenshot if you hate searching / accessibility. Put inference cards in your pull requests, reddit posts, or wherever you talk about your LLM life.

Here is the the world’s first inference card, for my own slop machine.

+----------------Inference Card v1-----------------+<br>| Who+when: cmart.blog, 2026-06-25 |<br>| Weights repo: hf.co/unsloth/Qwen3.6-27B-GGUF |<br>| Quantization: UD-Q4_K_XL |<br>| Platform: Thinkpad T480, Debian 13, eGPU dock |<br>| Accelerator+mem: AMD Radeon AI Pro 9700, 32 GB |<br>| Engine+ver: llama.cpp b9733 |<br>| GPU runtime+ver: ROCm 7.2.4 |<br>|----------------------Tok/s-----------------------|<br>| Concurrency | Context depth |<br>| ↓ | Stage | Empty | 4096 | 16384 | 65536 |<br>|----|---------|--------|--------|--------|--------|<br>| 1 | prefill | 667 | 669 | 640 | 474 |<br>| 1 | decode | 32.1 | 24.8 | 26.6 | 22.9 |<br>| 2 | prefill | 519 | 588 | | |<br>| 2 | decode | 23.3 | 16.2 | | |<br>| 4 | prefill | 526 | 537 | | |<br>| 4 | decode | 16.4 | 9.80 | | |<br>+------------------Config / Notes------------------+

Serving with:

./llama-server \<br>--hf-repo unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \<br>--gpu-layers all \<br>--spec-type draft-mtp \<br>--spec-draft-n-max 4 \<br>--chat-template-file ~/Qwen-Fixed-Chat-Templates/chat_template.jinja

Measuring with:

uv run llama-benchy \<br>--base-url http://localhost:8080/v1 \<br>--api-key "" \<br>--model "unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL" \<br>--tokenizer "Qwen/Qwen3-32B" \<br>--pp 4096 \<br>--tg 128 \<br>--depth 0 4096 16384 65536 \<br>--concurrency 1 2 4 \<br>--runs 1 \<br>--latency-mode generation

GPU is under-volted with increased power cap via https://github.com/kyuz0/amd-r9700-vllm-toolboxes/blob/main/TUNING.md

+----------------End Config / Notes----------------+<br>FAQ

How do you make an inference card? You copy mine from this page and edit the fields. If you hate overtype mode, paste my card into your LLM and ask it to fill in your details.

You&rsquo;re using, e.g., a fork of vLLM? Then specify the repo URL and commit hash instead of the release version.

You ran out of space on the card? Add another line or make the card wider. There are...

inference card cards speed ldquo rdquo

Related Articles