Precision Matters in Block Scales – Thinking
Skip to content
Menu
George Constantinides
Research
June 11, 2026
This is the third post in a sequence relating to the geometry of block number formats as angle preservers. In my previous post, I argued that block number formats remain direction preservers even when their block scales are quantised all the way down to powers of two, as is common in some number representations like the MX concrete formats. The main result there was that exponent-only block scaling perturbs direction by at most about , and that is not that much in high dimensions. So we ended last time with the nice result that very coarse block-scale quantisation is relatively benign.
But that doesn’t mean power-of-two scaling is optimal. If we have a fixed budget of bits for representing block scales, how should we spend them? Specifically, should we spend them on exponent range, or on significand precision?
This post argues that the answer becomes much clearer once a high-precision tensor-wide scale is introduced, which is exactly the kind of two-level scaling used in NVIDIA’s NVFP4 format. In NVFP4, 4-bit E2M1 values are combined with an FP8 E4M3 scale for each 16-value micro-block and a second-level FP32 scale for the tensor.
With such a tensor-wide scale, the block scales are relieved of their duty to try to capture the global magnitude of the tensor. Instead, we can ask a more focused task of them: reconstruct the relative amplitudes of the blocks, so that the global direction of the represented vector is preserved.
Since drafting this post, Bardia Zadeh and I have also written Direction-Preserving Number Representations, which I blogged about separately. That paper studies the related question of what directions can be obtained when each coordinate of a vector is drawn from a finite scalar alphabet. This post is about block scales rather than scalar elements, but the same product-structured geometry reappears one level higher.
I will argue in this post that once we look at the problem that way, precision in the block scales starts to matter much more. This leads to a rough rule of thumb for the relationship between block scale formats and vector lengths.
What a tensor-wide scale changes
Suppose, as per my previous posts, that each block is represented as , where is a chosen mantissa vector and is the ideal real-valued block scale for that mantissa direction.
Now suppose that the final represented tensor has the form
where
is the high-precision tensor-wide scale,
is the low-precision per-block scale, and
denotes direct sum (block concatenation).
Of course, the tensor-wide scale has no effect on direction at all: it multiplies the whole tensor uniformly, so it only changes magnitude. That means the tensor-wide scale can be used to absorb the global length of the vector, leaving the block scales to encode relative block amplitudes.
In other words, once a tensor-wide scale is present, the block scales stop answering the question "how large is this tensor?" and instead answer the question "how do the blocks compare with one another?"
Exact scale-only cosine factor
Let denote the ideal blockwise representation obtained using the real-valued scales , and let denote the represented tensor after block-scale quantisation.
Write
Then the represented block is simply
So scale quantisation does not change any chosen block direction, it only rescales the ideal projected blocks.
Define
so that is the fraction of the ideal projected energy contained in block .
Then, exactly as in the previous post, we have
This says that directional distortion from block-scale quantisation depends only on how uneven the multiplicative scale errors are across blocks.
If all blocks were rescaled by the same factor, direction would be unchanged.
Two jobs for two different kinds of bits
A block-scale format does two things.
First, its exponent bits determine what range of relative block scales can be represented without clipping or underflow.
Second, its significand bits determine how accurately the in-range scales are represented.
So there are really two error sources:
tail loss , from blocks whose scales fall outside range;
in-range uneven rescaling , from finite precision within range.
The interesting question is how these trade off when the total number of scale bits is fixed.
A conservative scale-only view for fixed
Suppose the tensor is divided into blocks.
If we want a format-level guarantee, a natural scale-only question is:
for a given number of blocks , how should a fixed budget of scale bits be split between exponent and significand so as to control the worst-case angular loss caused by scale quantisation?
I am deliberately saying "scale-only" here because this is not a claim about the globally optimal scalar alphabet (a problem Bardia and I cover in our preprint, linked to above), nor about the full problem of choosing the mantissa vectors. It is a conservative model of...