ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization — ROCm Blogs
Skip to main content
Back to top
Ctrl+K
ROCm blogs
ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
Contents
ATOM: Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization#
June 15, 2026 by Lingpeng Jin, Carlus Huang, Hattie Wu, Chuan Li, Peng Sun, Barsoum Emad.
5 min read. | 1286 total words.
Software tools & optimizations
AI/ML, LLM, Serving, Optimization, Performance
AI
Lingpeng Jin, Carlus Huang, Hattie Wu, Chuan Li, Peng Sun, Barsoum Emad
English
-->
As LLM serving enters a phase defined by high concurrency, long-context workloads, sparse MoE activation, and multi-GPU deployment, the challenge is no longer basic functionality but sustaining peak efficiency on AMD GPUs under production-scale load. ATOM (AiTer Optimized Model) is built for that goal, following four core principles: system-level optimization for LLM inference on AMD Instinct™ GPUs, kernel-level acceleration through AITER, distributed inference scaling with MORI, and a rollout-engine path for RL workloads. It builds on earlier ROCm blog coverage of AITER and vLLM-ATOM, moving from kernel and plugin acceleration into the standalone ATOM inference engine. Rather than being a generic framework adapted to the ROCm™ software, ATOM is an execution engine designed with ROCm-first priorities, AITER-native operators, and deep optimization on the inference-critical path. Aligned with the AMD Instinct roadmap from single-node optimization to multi-node scale-out, ATOM evolves its architecture, kernel strategy, and distributed execution model in lockstep with each hardware generation.
This blog covers six topics: ATOM’s software positioning in the AMD AI stack, the ATOM architecture, current feature scope, model coverage, benchmark dashboard usage, and practical takeaways.
By the end of this blog, you will have a practical view of where ATOM fits in the AMD AI software stack, what it supports today, and how to use ATOM recipes and dashboard data for deployment and tuning decisions.
Software Positioning in the AMD AI Stack#
To understand ATOM’s role clearly, it is useful to place it inside the AMD AI software stack from bottom to top:
ROCm (Foundation platform) : Open-source AMD accelerator software platform, including runtime, compiler, and core libraries such as HIP, RCCL, MIOpen, and rocBLAS.
AITER (Kernel acceleration layer) : High-performance kernel library for inference-critical operators, including Flash/Paged Attention, GEMM (FP8/MXFP4/INT8/INT4), Fused MoE, and norm/activation/position-encoding fusions.
MoRI (Communication and RDMA layer) : Modular RDMA and traffic-control stack optimized for HBM/XGMI/RDMA paths, with EP dispatch/combine and KV transfer support for distributed MoE serving.
ATOM (Inference engine layer) : The serving/runtime layer that exposes OpenAI-compatible APIs and coordinates scheduling, KV cache, torch.compile/HipGraph execution, TP/DP/EP parallelism, speculative decoding, and plugin integration.
This layering clarifies ATOM’s software positioning: ATOM is the system-level inference engine that orchestrates model execution end-to-end, while AITER and MoRI provide the underlying compute-kernel and communication acceleration paths that ATOM composes into production serving performance.
Architecture Overview: From API to GPU Execution#
ATOM currently supports two deployment modes:
Standalone ATOM serving mode
ATOM runs as an independent inference service stack and directly exposes OpenAI-compatible serving APIs.
Ecosystem-compatible deployment mode
ATOM integrates with the vLLM and SGLang ecosystem through compatible plugin paths, allowing users to adopt ATOM acceleration without rebuilding the full serving platform.
This blog focuses on the standalone serving mode. For ecosystem-compatible deployment, see the vLLM-ATOM blog.
ATOM follows a mainstream inference engine architecture pattern, but with stronger ROCm/AITER-oriented execution design. Figure 1 shows the software architecture used in standalone serving mode.
Figure 1. ATOM software architecture stack.
Serving Interfaces : Entry surface for sync, async, and streaming inference requests.
InputOutputProcessor : Tokenization/detokenization and TTFT/TPOT statistics.
LLMEngine : OpenAI-compatible serving engine entry and request handoff.
CoreManager + EngineCore : Multi-process orchestration and per-DP-rank runtime loop (intake -> schedule -> execute -> output) over ZMQ.
Scheduler + BlockManager + Parallelism Strategy : Prefill-first batching, KV block lifecycle/prefix cache, and TP/DP/EP policy application.
ModelRunner -> Modeling -> Model Ops : Execution chain for prepare/run/postprocess, forward/decode flow construction, and dispatch to optimized ops (attention, MoE, sampling, MTP, quantization kernels).
A typical request lifecycle:
The request enters LLMEngine, is...