FMA Instruction Set

tosh1 pts0 comments

FMA instruction set - Wikipedia

Jump to content

Search

Search

Donate

Create account

Log in

Personal tools

Donate

Create account

Log in

FMA instruction set

8 languages

Català<br>Deutsch<br>Français<br>Italiano<br>한국어<br>Русский<br>Српски / srpski<br>中文

Edit links

From Wikipedia, the free encyclopedia

Extension to the x86 instruction set

Wikibooks has a book on the topic of: X86 Assembly/AVX, AVX2, FMA3, FMA4

The FMA instruction set is an extension to the 128- and 256-bit Streaming SIMD Extensions instructions in the x86 microprocessor instruction set to perform fused multiply–add (FMA) operations.[1] There are two variants:

FMA4 is supported in AMD processors starting with the Bulldozer architecture. FMA4 was performed in hardware before FMA3 was. Support for FMA4 has been removed since Zen 1.[2]

FMA3 is supported in AMD processors starting with the Piledriver architecture and Intel starting with Haswell processors and Broadwell processors since 2014.

Instructions<br>[edit]

FMA3 and FMA4 instructions have almost identical functionality, but are not compatible. Both contain fused multiply–add (FMA) instructions for floating-point scalar and SIMD operations, but FMA3 instructions have three operands, while FMA4 ones have four. The FMA operation has the form d = round(a · b + c), where the round function performs a rounding to allow the result to fit within the destination register if there are too many significant bits to fit within the destination.

The four-operand form (FMA4) allows a, b, c and d to be four different registers, while the three-operand form (FMA3) requires that d be the same register as a, b or c. The three-operand form makes the code shorter and the hardware implementation slightly simpler, while the four-operand form provides more programming flexibility.

See XOP instruction set for more discussion of compatibility issues between Intel and AMD.

FMA3 instruction set<br>[edit]

CPUs with FMA3<br>[edit]

AMD<br>Piledriver (2012) and newer microarchitectures[3]<br>2nd gen APUs, "Trinity" (32nm), May 15, 2012

2nd gen "Bulldozer" (bdver2) with Piledriver cores, October 23, 2012

Intel<br>Haswell (2013) and newer processors, except Pentiums and Celerons[4][5]

Excerpt from FMA3<br>[edit]

Supported commands include

Mnemonic<br>Operation<br>Mnemonic<br>Operation

VFMADD<br>result = + a · b + c<br>VFMADDSUB<br>result = a · b + c for i = 1, 3, ...<br>result = a · b − c for i = 0, 2, ...

VFN MADD<br>result = − a · b + c

VFMSUB<br>result = + a · b − c<br>VFMSUBADD<br>result = a · b − c for i = 1, 3, ...<br>result = a · b + c for i = 0, 2, ...

VFN MSUB<br>result = − a · b − c

Note

VFN MADD is result = − a · b + c, not result = − (a · b + c).

VFN MSUB generates a −0 when all inputs are zero.

Explicit order of operands is included in the mnemonic using numbers "132", "213", and "231":

Postfix<br>Operation<br>possible<br>memory operand<br>overwrites

132<br>a = a · c + b<br>c (factor)<br>a (other factor)

213<br>a = b · a + c<br>c (summand)<br>a (factor)

231<br>a = b · c + a<br>c (factor)<br>a (summand)

as well as operand format (packed or scalar) and size (single or double).

Postfix<br>precision<br>size<br>Postfix<br>precision<br>size

SS<br>Single<br>00× 32 bit<br>SD<br>Double<br>0× 64 bit

PS x<br>04× 32 bit<br>PD x<br>2× 64 bit

PS y<br>08× 32 bit<br>PD y<br>4× 64 bit

PS z<br>16× 32 bit<br>PD z<br>8× 64 bit

This results in

Encoding

Mnemonic

Operands

Operation

VEX.256.66.0F38.W1 98 /r

VFMADD132 PDy

ymm, ymm, ymm/m256

a = a · c + b

VEX.256.66.0F38.W0 98 /r

VFMADD132 PSy

VEX.128.66.0F38.W1 98 /r

VFMADD132 PDx

xmm, xmm, xmm/m128

VEX.128.66.0F38.W0 98 /r

VFMADD132 PSx

VEX.LIG.66.0F38.W1 99 /r

VFMADD132 SD

xmm, xmm, xmm/m64

VEX.LIG.66.0F38.W0 99 /r

VFMADD132 SS

xmm, xmm, xmm/m32

VEX.256.66.0F38.W1 A8 /r

VFMADD213 PDy

ymm, ymm, ymm/m256

a = b · a + c

VEX.256.66.0F38.W0 A8 /r

VFMADD213 PSy

VEX.128.66.0F38.W1 A8 /r

VFMADD213 PDx

xmm, xmm, xmm/m128

VEX.128.66.0F38.W0 A8 /r

VFMADD213 PSx

VEX.LIG.66.0F38.W1 A9 /r

VFMADD213 SD

xmm, xmm, xmm/m64

VEX.LIG.66.0F38.W0 A9 /r

VFMADD213 SS

xmm, xmm, xmm/m32

VEX.256.66.0F38.W1 B8 /r

VFMADD231 PDy

ymm, ymm, ymm/m256

a = b · c + a

VEX.256.66.0F38.W0 B8 /r

VFMADD231 PSy

VEX.128.66.0F38.W1 B8 /r

VFMADD231 PDx

xmm, xmm, xmm/m128

VEX.128.66.0F38.W0 B8 /r

VFMADD231 PSx

VEX.LIG.66.0F38.W1 B9 /r

VFMADD231 SD

xmm, xmm, xmm/m64

VEX.LIG.66.0F38.W0 B9 /r

VFMADD231 SS

xmm, xmm, xmm/m32

FMA4 instruction set<br>[edit]

CPUs with FMA4<br>[edit]

AMD<br>"Heavy Equipment" processors<br>Bulldozer-based processors, October 12, 2011[6]

Piledriver-based processors[7]

Steamroller-based processors

Excavator-based processors (including "v2")

Zen: WikiChip's testing shows FMA4 still appears to work (under the conditions of the tests) despite not being officially supported and not even reported by CPUID. This has also been confirmed by Agner Fog.[8] But other tests gave wrong results.[9] AMD Official Web Site FMA4 Support Note ZEN CPUs = AMD ThreadRipper 1900x, R7 Pro 1800, 1700, R5 Pro 1600, 1500, R3 Pro 1300, 1200, R3 2200G, R5 2400G.[10][11][12]

Intel<br>Intel has not released CPUs...

0f38 fma4 result processors instruction fma3

Related Articles