Integer Quantization: Deep Dive

matt_d1 pts0 comments

Integer Quantization: Deep Dive 🤿

Menu

On this page

A lot has happened in transformer quantization over the past few years, from barely being able to quantize a 7B model in INT8 without destroying accuracy, to routinely fitting a 70B model in 4-bits on a single GPU. But existing guides on the topic are fragmented: either focused on a specific technique or on how to use a library. I’ve been working on integer quantization for fixed-point hardware for a while now and my goal with this series is to bridge that gap: building the core ideas carefully and tracing how the field has evolved, each technique motivated by the problems of what came before. This first post covers the foundations: what quantization is, why it’s hard, and the math behind it.

What is Quantization & why should you care? ¶

Quantization is the process of representing high-precision values using fewer bits. In practice, this means storing weights and (optionally) activations in lower precision (e.g., int8 instead of fp16), introducing a small approximation error.

The most immediate and easy-to-realize benefit of quantization is memory reduction. As a rule of thumb, a model with N billion parameters requires roughly 2 × N GB of memory when stored in 16-bit precision. Quantizing to 8-bit or 4-bit reduces this footprint by 2× and 4×, respectively.

There is also a hardware advantage. In 2014, Mark Horowitz, from Stanford University published a paper Computing’s Energy Problem which studied fp operations vs integer operations:

Energy Costs for various operations on a 45nm CMOS node. Source: Computing’s Energy Problem

So, integer arithmetic consumes lesser energy , specifically int8 add consumes 30x less energy than fp32 add & int8 mul consumes 18x less energy than fp32 mul. Lower precision hardware is also faster & consumes lesser silicon area than floating point.

How do these benefits translate to real-world gains? It depends on the bottleneck:

Compute-bound workloads (e.g., CNNs, LLM prefill) :<br>Quantization can improve throughput since lower-precision arithmetic is faster and consumes lesser energy.

Memory-bandwidth-bound workloads (e.g., LLM decoding) :<br>Quantization reduces the amount of data moved, improving performance by lowering memory bandwidth pressure.

By this point, the motivation should be clear: quantization reduces memory, lowers energy consumption, and can improve performance. Next, we will look at the hardware unit that executes fixed point arthmetic.

Multiply Accumulate Unit &para;

The dominant operation in neural networks is matrix multiplication . Modern hardware accelerators optimize this using specialized units called Multiply–Accumulate (MAC) Units :

Matrix–vector computation in neural network accelerator hardware. Source: A White Paper on Neural Network Quantization

The diagram represents a typical matrix–vector multiply unit in neural network accelerators. This is the building block for matrix multiplications and convolutions. The two fundamental components are the processing elements \(C_{n,m}\) and the accumulators \(A_n\).

The computation proceeds as follows:

The accumulators are first initialized with the bias value \(b_n\)

In the next cycle, weights \(W_{n,m}\) and input values \(x_m\) are loaded

Their product is computed at each processing element:

$$C_{n,m} = W_{n,m} \cdot x_m$$

The results are then accumulated:

$$A_n = b_n + \sum_{m} C_{n,m}$$

How is quantization done? &para;

Starting from a real-valued vector \(x\), we map it to an integer grid \(\{x_{\text{int}}^{\min}, \ldots, x_{\text{int}}^{\max}\}\):

$$<br>x_{\text{int}} =<br>\mathrm{clamp}<br>\left(<br>\left\lfloor \frac{x}{s} \right\rceil + z,\;<br>x_{\text{int}}^{\min},\;<br>x_{\text{int}}^{\max}<br>\right)<br>$$

Here:

\(s\) is the scale

\(z\) is the zero-point (offset)

\(\lfloor \cdot \rceil\) denotes rounding to the nearest integer

The clamp operation ensures the result lies within the valid integer range:

So, the idea is to scale and shift the floating-point value, then clamp it to fit within the integer grid.

Quantization Simulation (Fake Quantization) &para;

Instead of running quantized models directly on target hardware, we often simulate quantization on general-purpose hardware using high-level frameworks like PyTorch. This is commonly referred to as fake quantization .

The key idea is simple: we mimic the effects of quantization while still executing operations in floating point. This allows us to study accuracy and perform experiments like Quantization Aware Training (QAT) without requiring specialized hardware.

To do this, we:

Quantize the input to an integer grid

Dequantize it back to floating point

Perform all computations in floating point on standard hardware (e.g., GPUs)

The dequantization step maps integers back to real values:

$$<br>\widehat{\mathbf{x}} = s \left( \mathbf{x}_{\text{int}} - z \right)<br>$$

Combining quantization and dequantization, we get:

$$<br>\widehat{\mathbf{x}} = q(\mathbf{x};...

quantization integer hardware point energy text

Related Articles