GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Subscribe
Blog
Dark
If you have spent any time trying to run a large language model on your own machine, you have hit the same wall everyone does: the model is enormous and your VRAM is not. A 70-billion-parameter model in its native 16-bit precision wants about 140 GB of memory just to hold the weights. Almost nobody has that. Quantization is the trick that closes the gap — and it is also where the jargon avalanche begins. GGUF. GPTQ. AWQ. Q4_K_M. NF4. EXL2.<br>This guide is the version we wish existed when we started: what quantization actually does, what each of the major formats is really for, the honest trade-offs, and a decision table you can use in thirty seconds. No hand-waving, no assuming you already read the papers.<br>What quantization actually is<br>A model's "weights" are just a giant pile of numbers. By default each one is stored at 16-bit precision (FP16 or BF16) — two bytes per weight. Quantization stores those same numbers using fewer bits: 8, 5, 4, sometimes as low as 2. Fewer bits per weight means a smaller file and less memory, at the cost of some precision.<br>The memory math is refreshingly simple. Multiply the parameter count by the bytes per weight:<br>FP16 (16-bit): 2 bytes/weight → a 70B model needs ~140 GB<br>8-bit: ~1 byte/weight → ~70–75 GB<br>4-bit: ~0.5 byte/weight → ~40 GB<br>That single jump from 16-bit to 4-bit is what turns "needs a data-center GPU" into "runs on a 48 GB card, or a unified-memory box." The surprising part — and the reason quantization is everywhere — is that a well-done 4-bit model is shockingly close in quality to the original. The degradation is real but small, and for most use it is invisible. We break the exact numbers down in our companion piece on how much VRAM you actually need for a 70B model.<br>The three big formats<br>GGUF — the one most people should use<br>GGUF is the file format used by llama.cpp (and everything built on it: Ollama, LM Studio, Jan, KoboldCpp). It is the successor to the older GGML format. If you download a model from Hugging Face and the filename ends in .gguf, this is what you have.<br>GGUF's superpower is flexibility . It runs on CPU, GPU, or a mix of both — you can offload as many layers to your GPU as fit and let the CPU handle the rest. That is why it is the default for Mac users (Apple Silicon via Metal) and for anyone whose model does not quite fit in VRAM. It also ships in a huge range of quantization levels, the "k-quants," which is where the cryptic suffixes come from:<br>Q8_0 — ~8.5 bits/weight, essentially lossless. Use it when you have the memory and want zero compromise.<br>Q6_K — ~6.6 bpw, near-indistinguishable from full precision.<br>Q5_K_M — ~5.7 bpw, a high-quality middle ground.<br>Q4_K_M — ~4.8 bpw. This is the community default and the sweet spot: about a 1% perplexity hit for roughly a third of the original size.<br>Q3_K_M — ~3.9 bpw. Noticeably more degraded, but usable when memory is tight.<br>Q2_K — ~3.4 bpw. The "I just want it to load at all" tier. Quality drops meaningfully; treat it as a last resort.<br>The letter suffix matters: _M (medium) and _S (small) trade a sliver of quality for size. The newer I-quants (IQ4_XS, IQ3_M, etc.) squeeze out a bit more quality per byte using importance-matrix calibration, at the cost of slightly slower inference on some hardware.<br>Use GGUF if: you are on a Mac, you are mixing CPU and GPU, you want the widest model selection, or you simply want the path of least resistance. For most readers, the honest answer is "start here."<br>GPTQ — the GPU-native 4-bit standard<br>GPTQ is a post-training quantization method introduced by Frantar et al. in 2022 (arXiv:2210.17323, later presented at ICLR 2023). Rather than naively rounding every weight, it uses approximate second-order (Hessian) information to quantize weights one column at a time while compensating for the error introduced — a one-shot process that runs in a few GPU-hours even for huge models.<br>The practical point: GPTQ is weight-only, GPU-only, and fast at inference . It shines when the whole model fits in VRAM and you are serving it through a GPU runtime. It was the dominant 4-bit format on Hugging Face for a long time and is widely supported by serving stacks. Its weakness is that it does not gracefully spill to CPU the way GGUF does — it is an all-in-VRAM format.<br>Use GPTQ if: your model fits entirely in GPU memory and you want a mature, well-supported 4-bit format for GPU serving.<br>AWQ — the accuracy-focused challenger<br>AWQ (Activation-aware Weight Quantization) comes from Lin et al. (arXiv:2306.00978, MLSys 2024 best paper). Its insight is clever: not all weights matter equally. A small fraction (~1%) of "salient" weight channels — identified by looking at the activations flowing through them, not the weights themselves — carry an outsized share of the model's quality. AWQ protects those channels by scaling them before quantizing, so the important parts survive 4-bit...