A system programmer's guide to LLM inference

A system programmer’s guide to LLM inference – Xiangpeng's blog

WarningAcknowledgments

Let’s take a moment to thank my PhD sponsors: InfluxData, Bauplan, SpiralDB, and the taxpayers of the State of Wisconsin and the federal government.

LLMs have become so important that I (probably you as well) want to understand them better, and the best way to learn is to build one.

In this blog post, I’ll share what I’ve learned about LLM inference, from the perspective of a systems programmer.

I pick the model Qwen3.6-35B-A3B-UD-Q4_K_M.gguf so that it runs on most machines but is also complex enough to count as a “modern LLM” 1. This is the only model to support here. By the end, we’ll be able to run prefill at 100 tokens/s and decode at 15 tokens/s, not so bad on a CPU-only machine.

1 The model was released in April 2026.

This blog covers most of the important parts of a local LLM inference engine:

The LLM architecture

Quantization

Fast matrix multiplication

KV cache

But does not cover:

GPU acceleration, this is a pure CPU inference engine. I’ll probably do a GPU follow up later.

No MTP (speculative decoding), because we are not GPU yet.

Things that are {vendor-specific | closed-source}, e.g., CUDA, are not covered

How to read the name? Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Qwen (pronounced /kwɛn/, like “when” with a “kw”) is the model family name; it comes from Alibaba, a not so popular Chinese company (in tech world).

3.6 is the version number; previous versions were 3.5 and 3, so the family has been around for a while.

35B is the model size. 35 B illion is the number of parameters; if each parameter were 8 bits (fp8 or int8), the model would be about 35 GB.

A3B means the model a ctivates 3 B illion parameters to generate a token. It also implies the model is an MoE (Mixture of Experts) model — unlike dense models where all parameters are activated for every token.

UD-Q4_K_M means the model is q uantized to 4 bits (Q4), and K_M is the quantization scheme (there are many ways to quantize a model). UD stands for Unsloth Dynamic quantization; Unsloth is a company that produces these quantized variants.

gguf is the model file format. Unlike safetensors, gguf is a self-contained format: a single file holds everything you need to run the model. In practice, it’s just metadata plus the model weights.

What’s inside the file?

The file format is pretty boring (as intended): some metadata describing the model, followed by pairs of tensor info and tensor data, as shown in Figure 1.

Figure 1: The GGUF file layout

The tensor info is stored at the beginning of the file, and points to the actual tensor data through the offset and len fields:

struct TensorInfo { name: String, // e.g., `blk.3.attn_norm_weight` shape: Vecu64>, dtype: u32, offset: u64, len: u64,

Note that dtype is a per-tensor field, which means two tensors in the same file may use different data types (i.e., different quantization schemes). This lets us store important tensors (e.g., the embedding weights) with more bits for higher precision, while quantizing the rest more aggressively. As you might guess, per-tensor quantization is a fine art.

A closer look at the quantization schemes shows that the model has 6 different types of tensors, more than half of the bytes are quantized to Q4_K:

BF16 2 tensors 1.00 MB F32 368 tensors 99.78 MB Q4_K 82 tensors 11808.00 MB Q5_K 38 tensors 6688.00 MB Q6_K 4 tensors 1027.85 MB Q8_0 259 tensors 1978.38 MB total 753 tensors 21603.01 MB The metadata also encodes the model architecture: the number of layers, the model family, and so on.

The model architecture

The model belongs to the Qwen3-Next family; for background on the architecture lineage, see Sebastian Raschka’s “Big LLM Architecture Comparison”.

Figure 2 is a simplified view of the architecture and its weight distribution. For every input token, the model first converts it into an embedding vector (with dimension 2048 for this model), then runs it through several layers of computation to produce the final output.

Qwen3.6 is not a typical transformer: it mixes the so-called DeltaNet layers with conventional Attention layers at a 3:1 ratio. We will get to the details later; the high-level intuition is that attention’s per-token state (the KV cache) grows linearly with sequence length and its compute is quadratic, while DeltaNet’s state is fixed-size — this helps the model scale to longer context.

All Qwen3.6 family models share this architecture; larger members of the family scale by adding more layers and using a wider hidden dimension (e.g., the Qwen3-Next-80B variant has 48 layers and a 2048 hidden dim).2

2 Qwen3.6-35B-A3B has 40 layers organized as 10 repetitions of (3 DeltaNet + 1 Attention) blocks, so 30 DeltaNet layers and 10 Attention layers. See the apxml spec page.

Figure 2: The Qwen3.6-35B-A3B model architecture and its weights distribution

Table below shows the more detailed weights distribution:

Group Params Stored Share Token...

A system programmer's guide to LLM inference

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy