Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats

Choosing a GGUF Model: K-Quants, I-Quants, and Legacy Formats

The Kaitchup – AI on a Budget

SubscribeSign in

Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats Reviewing the differences between each type and their impact on accuracy, throughput, and memory.

Benjamin Marie Oct 13, 2025 ∙ Paid

Image generated with ChatGPT For local LLM inference, the GGUF format, introduced by llama.cpp and popularized by frontends like Ollama, is by far the most common choice. Each major LLM release is quickly followed by a wave of community GGUF conversions on the Hugging Face Hub. Prominent curators include Unsloth and Bartowski (also: TheBloke remains widely used), among many others. Repos often provide dozens of variants per model tuned for different memory/quality trade-offs.

For instance, Unsloth released 25 GGUF versions of Qwen3 8B and 26 versions for DeepSeek-V3.1-Terminus.

Unsloth’s 25 GGUF versions of Qwen3-8B! That’s a lot of choice, but beyond filename and size, there’s rarely a clear guide to accuracy, speed, or trade-offs for each format. New variants land regularly, so I wrote this guide to demystify the main GGUF-serializable formats across architectures: how they work, why their accuracy/size/throughput differ, and when to pick each one. (This guide doesn’t cover converting your own models; I’ve written about that separately.) If you are looking for “How to Run GGUF Models,” check this article. What Is GGUF?

I introduced GGUF in this article:

GGUF Quantization for Fast and Memory-Efficient Inference on Your CPU Benjamin Marie February 29, 2024

Read full story

TL;DR Most GGUF weight formats are blockwise. A matrix is split into fixed-size blocks, each block is represented with compact integer parameters, and a small set of per-block parameters reconstructs approximate floating weights at inference. The design space is defined by three choices: The number of bits used for the parameters

The block size

The dequantization rule (linear scale and zero-point, multi-scale hierarchies, or non-linear/LUT-assisted schemes)

The more expressive the dequantization rule, the lower the error you can achieve for the same number of bits, at some decode cost.

In the next sections, “bits/weight” refers to the effective average once overheads like block scales are included. Values are approximate and vary a little by implementation and tensor shape, but they are useful for thinking about trade-offs. Legacy Formats: Q_0 and Q_1

The legacy family of GGUF formats, Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, implements classic per-block linear quantization. A block stores n-bit weight codes and either one scale (the “_0” variants, symmetric) or one scale plus one offset/zero-point (the “_1” variants, asymmetric). Dequantization is a single affine transform per block. These formats are simple to decode and therefore fast. Their weakness is representational: one affine map per block cannot model skewed or heavy-tailed weight distributions as well as newer schemes. At 8-bit, the difference is negligible, and Q8_0 is effectively near-lossless for most LLMs. That’s why we can still see a lot of Q8_0 models being published on the HF Hub. At 5- and especially 4-bit, legacy formats leave measurable accuracy on the table compared with modern alternatives. They remain relevant for maximum simplicity and compatibility, and on some older devices, their very cheap decoding can still be a speed win. A concise way to think about the legacy set is that Q8_0 is a safe INT8 baseline, Q5_0/1 are decent mid-range choices if you must stick to legacy, and Q4_0/1 are largely superseded by K- and I-quants for quality per bit. K-quants: Modern Default for 3–6 Bits

K-quants (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, and their mixed variants like _S, _M, _L) introduce structure beyond a single affine per block. We saw how to make these models here:

GGUF Quantization with Imatrix and K-Quantization to Run LLMs on Your CPU Benjamin Marie September 9, 2024

Read full story

The most common pattern is a two-level scheme: small blocks with their own scale and zero-point grouped into a super-block with an additional scale/offset. In practice, this behaves like a piecewise-affine approximation that captures both local and global variation with little overhead. This is an asymmetric quantization scheme (most variants map negatives and positives to different ranges), with the exceptions of Q3_K and Q5_K which are symmetric. They quantize weights in fixed-size groups (32-weight blocks packed into 256-weight “super-blocks”) and apply double-quantization to the per-group scales, first computing a scale for each group, then quantizing those scales again, reducing metadata overhead and improving quality-per-bit compared to legacy formats. The result is lower error at the same storage. For example, a typical Q4_K lands around the mid-4s bits/weight—slightly above Q4_0/1 once you count its extra parameters, but it achieves distinctly better...

Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews