Brain Floating-Point

bfloat16 floating-point format - Wikipedia

Jump to content

Donate

Create account

Personal tools

Donate

Create account

bfloat16 floating-point format

8 languages

Català Čeština Deutsch Français 日本語 한국어 Português 中文

Edit links

From Wikipedia, the free encyclopedia

Not to be confused with binary16, a different 16-bit floating-point format.

Floating-point number format used in computer processors

Floating-point formats IEEE 754 16-bit: Half (binary16)

32-bit: Single (binary32), decimal32

64-bit: Double (binary64), decimal64

128-bit: Quadruple (binary128), decimal128

256-bit: Octuple (binary256)

Extended precision

Other Minifloat

bfloat16

TensorFloat-32

Microsoft Binary Format

IBM hexadecimal floating-point

PMBus Linear-11

G.711 8-bit floats

Alternatives Arbitrary precision

Block floating point

Tapered floating point Posit

The bfloat16 (brain floating point )[1][2] floating-point format is a computer number format occupying 16 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. This format is a shortened (16-bit) version of the 32-bit IEEE 754 single-precision floating-point format (binary32) with the intent of accelerating machine learning and near-sensor computing.[3] It preserves the approximate dynamic range of 32-bit floating-point numbers by retaining 8 exponent bits, but supports only an 8-bit precision rather than the 24-bit significand of the binary32 format. More so than single-precision 32-bit floating-point numbers, bfloat16 numbers are unsuitable for integer calculations, but this is not their intended use. Bfloat16 is used to reduce the storage requirements and increase the calculation speed of machine learning algorithms.[4]

The bfloat16 format was developed by Google Brain, an artificial intelligence research group at Google. It is utilized in many CPUs, GPUs, and AI processors, such as Intel Xeon processors (AVX-512 BF16 extensions), Intel Data Center GPU, Intel Nervana NNP-L1000, Intel FPGAs,[5][6][7] AMD Zen, AMD Instinct, NVIDIA GPUs, Google Cloud TPUs,[8][9][10] AWS Inferentia, AWS Trainium, ARMv8.6-A,[11] and Apple's M2[12] and therefore A15 chips and later. Many libraries support bfloat16, such as CUDA,[13] Intel oneAPI Math Kernel Library, AMD ROCm,[14] AMD Optimizing CPU Libraries, PyTorch, and TensorFlow.[10][15] On these platforms, bfloat16 may also be used in mixed-precision arithmetic, where bfloat16 numbers may be operated on and expanded to wider data types.

bfloat16 floating-point format [edit]

bfloat16 has the following format:

Sign bit: 1 bit

Exponent width: 8 bits

Significand precision: 8 bits (7 explicitly stored, with an implicit leading bit), as opposed to 24 bits in a classical single-precision floating-point format

The bfloat16 format, being a shortened IEEE 754 single-precision 32-bit float, allows for fast conversion to and from an IEEE 754 single-precision 32-bit float; in conversion to the bfloat16 format, the exponent bits are preserved while the significand field can be reduced by truncation (thus corresponding to round toward 0) or other rounding mechanisms, ignoring the NaN special case. Preserving the exponent bits maintains the 32-bit float's range of ≈ 10−38 to ≈ 3 × 1038.[16]

The bits are laid out as follows:

IEEE half-precision 16-bit float

sign

exponent (5 bit)

fraction (10 bit)

15 14

bfloat16

sign

exponent (8 bit)

fraction (7 bit)

15 14

Nvidia's TensorFloat-32 (19 bits)

sign

exponent (8 bit)

fraction (10 bit)

18 17

ATI's fp24 format [17]

sign

exponent (7 bit)

fraction (16 bit)

23 22

16 15

Pixar's PXR24 format

sign

exponent (8 bit)

fraction (15 bit)

23 22

15 14

IEEE 754 single-precision 32-bit float

sign

exponent (8 bit)

fraction (23 bit)

31 30

23 22

Exponent encoding [edit]

The bfloat16 binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 127; also known as exponent bias in the IEEE 754 standard.

Emin = 01H−7FH = −126

Emax = FEH−7FH = 127

Exponent bias = 7FH = 127

Thus, in order to get the true exponent as defined by the offset-binary representation, the offset of 127 has to be subtracted from the value of the exponent field.

The minimum and maximum values of the exponent field (00H and FFH) are interpreted specially, like in the IEEE 754 standard formats.

Exponent

Significand zero

Significand non-zero

Equation

00H

zero, −0

subnormal numbers

(−1)signbit×2−126× 0.significandbits

01H, ..., FEH

normalized value

(−1)signbit×2exponentbits−127× 1.significandbits

FFH

±infinity

NaN (quiet, signaling)

The minimum positive normal value is 2−126 ≈ 1.18 × 10−38 and the minimum positive (subnormal) value is 2−126−7 = 2−133 ≈ 9.2 × 10−41.

Rounding and conversion [edit]

The most common use case is the conversion between IEEE 754 binary32 and bfloat16. The following section describes the conversion process and its rounding...

Brain Floating-Point

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs