Exact UNORM8 to Float

firephox2 pts0 comments

Exact UNORM8 to float | The ryg blog

Skip to content

Follow:<br>RSS<br>Twitter

The ryg blog

When I grow up I'll be an inventor.

Home

About

Coding

Compression

Computer Architecture

Demoscene

Graphics Pipeline

Maths

Multimedia

Networking

Papers

Stories

Thoughts

Uncategorized

Exact UNORM8 to float

November 6, 2024

GPUs support UNORM formats that represent a number inside [0,1] as an 8-bit unsigned integer. In exact arithmetic, the conversion to a floating-point number is straightforward: take the integer and divide it by 255. 8-bit integers are for sure machine numbers (exactly represented) in float32 and so is 255, so if you’re willing to do a "proper" divide, that’s the end of it; both inputs are exact, so the result of the division is the same as the result of the computation in exact arithmetic rounded to the nearest float32 (as per active rounding mode anyway), which is the best we can hope for.

The D3D11.3 functional spec chickened out here a bit (granted, this verbiage was already in the D3D10 version as I recall) and added the disclaimer that

Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.

For what it’s worth, I had reason to test this a while back (as part of my GPU BCn decoding experiments) and at the time anyway, all GPUs I got my hands on for testing seemed to do exact conversions anyway. It turns out that doing the conversion exactly is not particularly expensive in HW (that might be a post for another day) and certainly doesn’t require anything like a "real" divider, which usually does not exist in GPUs; correctly rounded float32 divisions, when requested, are usually done as a lengthy instruction sequence (plus possibly an even lengthier fallback handler in rare cases).

A conversion that is close enough for essentially all practical purposes is to simply multiply by the float32 reciprocal, i.e. 1.f / 255.f, instead. This is not the same as the exact calculation, but considering you’re quantizing data into an uint8 format to begin with, is almost certainly way more precision than you need.

What if, as a matter of principle, we want an exact solution, though?

Option 1: work in double (float64)

Not much to say here. Converting the 8-bit value x to double, multiplying by 1. / 255., then converting the result to float32 happens to give the same results as the exact calc (i.e. dividing x / 255.f). This is trivial to verify by exhaustive testing. It’s just 256 possible inputs.

Option 2: but I don’t want to use doubles either!

Picky, are we? OK, fine, let’s get to the meat of this post. What if we’d like an exact conversion, would prefer to not use a divide, and would also prefer to avoid using doubles?

This is the fun part. First, let’s turn 1/255 into a geometric series:

and therefore

Going from a single divide to an infinite series might seem like it’s making our problems worse, but it’s progress. The divides by 256k are easy because those are powers of 2, hence exact (and, incidentally, non-overlapping bit shifts). Of course we can’t sum infinitely many terms, but we also don’t need to. All we need is an approximation that gets close enough to round to the right value. How many series terms is that?

The hardest to round cases turn out to be powers of 2, x=2k, k = 0, …, 7. They’re all basically the same, so let’s stick with the easiest example x=1 (k=0). Float32 has 24 effective significand bits (only 23 stored, since a leading 1 is implied). To round correctly, we need 24 correct bits after the leading 1 bit (23 actual bits plus the first truncated bit for rounding)… and then actually another term after that to make sure we don’t accidentally hit an exact halfway case. Our infinite series never terminates, so we need that extra term for non-0 values to make sure the sticky bit gets set pre-normalization to indicate unambiguously that the value ought to round up (and not break the tie towards evens).

Each term contributes 8 bits, and in our x=1 case, normalizing the floating-point number shifts out the first term entirely. In short, we need 1 "sacrificial" term that might get mostly normalized away (in the power-of-2 cases), 3 more terms for the bulk of the mantissa, and then 1 more term after that to avoid the halfway-case problem and ensure our sticky bit gets set correctly come rounding time.

So am I proposing summing 5 terms of a series? That doesn’t sound fast! And, fortunately, no. We don’t need to sum terms one by one; if we’re going to be multiplying anyway, we might as well combine multiple terms into one. Here’s the actual solution I’m thinking of (note this assumes compilation with proper IEEE semantics, not fast math shenanigans):

// Constants here written for posterity but assumed<br>// to be evaluated at compile time.<br>// All values exactly representable as float32.<br>const float k0 = (1.f + 256.f + 65536.f) / 16777216.f;<br>const float k1 = k0 / 16777216.f;

// suppose x is integer in [0,255], then:<br>float ref = x / 255.f;

//...

exact float32 float terms term conversion

Related Articles