[ROCm] Add AMD GPU support via HIP for tiny-vllm by jeffdaily · Pull Request #2 · jmaczan/tiny-vllm · GitHub
//voltron/pull_requests_fragments/pull_request_layout" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//voltron/pull_requests_fragments/pull_request_layout;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
jmaczan
tiny-vllm
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>51
Star<br>821
Open<br>[ROCm] Add AMD GPU support via HIP for tiny-vllm#2jeffdaily wants to merge 3 commits intojmaczan:mainjmaczan/tiny-vllm:mainfrom jeffdaily:moat-portjeffdaily/tiny-vllm:moat-portCopy head branch name to clipboard
Conversation
jeffdaily
commented
Jun 17, 2026
Copy link
hipBLAS, and the `__nv_bfloat16` -> `__hip_bfloat16` type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for `__shfl_*_sync`, so the hardcoded `0xffffffff` becomes a `WARP_FULL_MASK` macro (`0xffffffffffffffffULL` on HIP, `0xffffffff` on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).
`CMakeLists.txt` gains a `USE_HIP` option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via `CMAKE_HIP_ARCHITECTURES` (it is not hardcoded):
```<br>cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja<br>cmake --build build<br>```
The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.
## Validation
Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).
On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and running prefill+decode produces coherent, correct output (for example "What is 2+2?" -> 4 and "Capital of France?" -> Paris), exercising the complete path (embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU MLP, lm_head). The CUDA build path is unchanged.
Authored with the assistance of Claude.<br>" data-view-component="true" class="dropdown-item btn-link">
Copy Markdown
This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the existing NVIDIA/CUDA build unchanged.
The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h compatibility header keeps the CUDA spellings in the source and aliases them to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the __nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro (0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).
CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded):
cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja<br>cmake --build build
The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.
Validation
Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).
On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and...