Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization — ROCm Blogs

Ctrl+K

ROCm blogs

ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

Contents

ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization#

June 15, 2026 by Lingpeng Jin, Carlus Huang, Hattie Wu, Chuan Li, Peng Sun, Barsoum Emad.

5 min read. | 1286 total words.

Software tools & optimizations

AI/ML, LLM, Serving, Optimization, Performance

Lingpeng Jin, Carlus Huang, Hattie Wu, Chuan Li, Peng Sun, Barsoum Emad

English

-->

As LLM serving enters a phase defined by high concurrency, long-context workloads, sparse MoE activation, and multi-GPU deployment, the challenge is no longer basic functionality but sustaining peak efficiency on AMD GPUs under production-scale load. ATOM (AiTer Optimized Model) is built for that goal, following four core principles: system-level optimization for LLM inference on AMD Instinct™ GPUs, kernel-level acceleration through AITER, distributed inference scaling with MORI, and a rollout-engine path for RL workloads. It builds on earlier ROCm blog coverage of AITER and vLLM-ATOM, moving from kernel and plugin acceleration into the standalone ATOM inference engine. Rather than being a generic framework adapted to the ROCm™ software, ATOM is an execution engine designed with ROCm-first priorities, AITER-native operators, and deep optimization on the inference-critical path. Aligned with the AMD Instinct roadmap from single-node optimization to multi-node scale-out, ATOM evolves its architecture, kernel strategy, and distributed execution model in lockstep with each hardware generation.

This blog covers six topics: ATOM’s software positioning in the AMD AI stack, the ATOM architecture, current feature scope, model coverage, benchmark dashboard usage, and practical takeaways.

By the end of this blog, you will have a practical view of where ATOM fits in the AMD AI software stack, what it supports today, and how to use ATOM recipes and dashboard data for deployment and tuning decisions.

Software Positioning in the AMD AI Stack#

To understand ATOM’s role clearly, it is useful to place it inside the AMD AI software stack from bottom to top:

ROCm (Foundation platform) : Open-source AMD accelerator software platform, including runtime, compiler, and core libraries such as HIP, RCCL, MIOpen, and rocBLAS.

AITER (Kernel acceleration layer) : High-performance kernel library for inference-critical operators, including Flash/Paged Attention, GEMM (FP8/MXFP4/INT8/INT4), Fused MoE, and norm/activation/position-encoding fusions.

MoRI (Communication and RDMA layer) : Modular RDMA and traffic-control stack optimized for HBM/XGMI/RDMA paths, with EP dispatch/combine and KV transfer support for distributed MoE serving.

ATOM (Inference engine layer) : The serving/runtime layer that exposes OpenAI-compatible APIs and coordinates scheduling, KV cache, torch.compile/HipGraph execution, TP/DP/EP parallelism, speculative decoding, and plugin integration.

This layering clarifies ATOM’s software positioning: ATOM is the system-level inference engine that orchestrates model execution end-to-end, while AITER and MoRI provide the underlying compute-kernel and communication acceleration paths that ATOM composes into production serving performance.

Architecture Overview: From API to GPU Execution#

ATOM currently supports two deployment modes:

Standalone ATOM serving mode

ATOM runs as an independent inference service stack and directly exposes OpenAI-compatible serving APIs.

Ecosystem-compatible deployment mode

ATOM integrates with the vLLM and SGLang ecosystem through compatible plugin paths, allowing users to adopt ATOM acceleration without rebuilding the full serving platform.

This blog focuses on the standalone serving mode. For ecosystem-compatible deployment, see the vLLM-ATOM blog.

ATOM follows a mainstream inference engine architecture pattern, but with stronger ROCm/AITER-oriented execution design. Figure 1 shows the software architecture used in standalone serving mode.

Figure 1. ATOM software architecture stack.

Serving Interfaces : Entry surface for sync, async, and streaming inference requests.

InputOutputProcessor : Tokenization/detokenization and TTFT/TPOT statistics.

LLMEngine : OpenAI-compatible serving engine entry and request handoff.

CoreManager + EngineCore : Multi-process orchestration and per-DP-rank runtime loop (intake -> schedule -> execute -> output) over ZMQ.

Scheduler + BlockManager + Parallelism Strategy : Prefill-first batching, KV block lifecycle/prefix cache, and TP/DP/EP policy application.

ModelRunner -> Modeling -> Model Ops : Execution chain for prepare/run/postprocess, forward/decode flow construction, and dispatch to optimized ops (attention, MoE, sampling, MTP, quantization kernels).

A typical request lifecycle:

The request enters LLMEngine, is...

Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews