Bringing up DeepSeek-V4-Flash on AMD MI300X
Bringing up DeepSeek-V4-Flash on AMD MI300X<br>1 Jun 2026 9 min read
At Doubleword we are building an inference<br>cloud designed for volume. To do that we have to reckon with the<br>enveloping compute shortage.
AMD’s MI300X launched in December 2023At AMD’s “Advancing AI” event, 6 December 2023. as AMD’s response to NVIDIA’s<br>H100, arriving alongside H200 in the same generation. It is an odd duck<br>in the world of high-end AI accelerators. While H100 prices are climbing<br>(up 40% in five months on one-year rentals, with on-demand capacity sold<br>out across every major NVIDIA partSemiAnalysis, The Great GPU Shortage: Rental Capacity, April 2026.), MI300X is perhaps still<br>underappreciated. 192GB of HBM3 per card against the H100’s 80GB,<br>comparable FP8 compute, list price roughly half. Yet you can rent one<br>on-demand today (from Hotaisle, for instance)<br>for noticeably less than the equivalent NVIDIA capacity.
The reason is software. The problems with running AI workloads on AMD have<br>been written about<br>elsewhere<br>exhaustively, and there are signs the gap is closing on AMD’s newer chipsSemiAnalysis’s InferenceX dashboard tracks the latest AMD parts (MI350X, MI355X) against current NVIDIA generations..<br>That new focus on software hasn’t extended back to old parts. As of early May<br>2026, running vLLM with DeepSeek-V4-Flash on MI300X just doesn’t work.
On paper MI300X is an excellent accelerator. We want it to work. This post is a<br>worklog of all the sharp edges and winding paths we found when we tried to get<br>it working.
FP8 dialect§
The MI300X was part of the accelerator generation that kicked off the<br>march toward lower bitwidths. LLM weights, and to a lesser extent<br>activations and KV caches, are less sensitive to numerical imprecision<br>than typical HPC workloads, so the Hopper generation of NVIDIA chips and<br>the first Instinct chips added hardware support for sub-16-bit precision<br>for the first time. The result is twice as many FLOPs applied to<br>workloads that correspondingly transfer half as much data.
The problem is that there was disagreement on the best way to build an<br>FP8 datatype. Graphcore and AMD proposed one standard<br>in a 2022 preprint, backed by<br>Qualcomm. Arm, Intel, and NVIDIA proposed another<br>through the Open Compute Project. In a rehash of some of the forks in<br>the road that led to IEEE 754This interview with William Kahan<br>is great read for how an arithmetic standard actually gets<br>made, including which arguments win and which are forgotten., different providers built in<br>different and incompatible behaviours.
Perhaps unsurprisingly given the list of backers on each side, the<br>AMD / Graphcore standard didn’t make it. AMD’s newer MI325, MI350, and<br>MI355X chips all moved over to OCP-standard FP8. But MI300X still only<br>works in the fnuz dialectfnuz means “finite, nans, unsigned zero”, i.e. no -0 and no<br>inf. These seem like sensible things to cut out for AI workloads at<br>small floating-point range, where every bit matters, but the dialect<br>never quite took off, and later AMD generations went back to the more<br>normal-looking FP8., so the initial vLLM work that went into<br>bringing up DeepSeek on AMD didn’t actually work for bringing DeepSeek<br>up on MI300X.
Lots of vLLM’s FP8 paths are aware of e4m3 versus e5m2 but not of<br>fnuz versus OCP. The two share their bit layout but differ in exponent<br>bias by one, so the same byte read as the wrong dialect comes back off<br>by exactly a factor of two. MI300X is the only major accelerator where<br>that distinction matters in practiceThroughout, we’ll note the relevant commits from the demo PRs in a public vLLM repo we put up for this post. 236de4e64 makes the<br>DeepSeek v4 compressor and fused compress / quant / cache writes use the<br>platform FP8 dtype so scales and cache bytes agree, and<br>bd06e5d87 routes the<br>sliding-window K-cache through a fnuz-aware fused quantise-and-insert helper..
Missing attention fast paths§
DeepSeek v4’s attention is sparse. Each query attends to a top-k subset of the<br>KV cache picked by a learned indexer, with sliding-window context<br>handled separately.
It’s got a lot of moving pieces: KV compression, the indexer, the<br>sliding-window path, FP8 caches feeding each. In a production deployment for maximum<br>performance, each piece needs special attention (no pun intended) in the form of a<br>tuned kernel.
The source of fast tuned kernels on AMD is AITER.<br>AITER is AMD’s tuned-kernel library, roughly the analog of what NVIDIA<br>users get from cuBLAS, cuDNN, FlashAttention, and Transformer Engine<br>combined. vLLM falls back to generic Triton when AITER doesn’t have a<br>path for a given shape, and generic Triton attention is several times<br>slower than a tuned kernel. AITER’s coverage for DSV4 is uneven, and<br>what coverage exists tends to target later AMD parts (CDNA4) rather<br>than the CDNA3 (gfx942) cores in MI300X.
The fallout from this has two different shapes. Some pieces are missing AITER paths<br>entirely on gfx942: paged MQA logits, sparse MLA...