AMD contributes their GPU support to tiny-vLLM

yu3zhou41 pts0 comments

[ROCm] Add AMD GPU support via HIP for tiny-vllm by jeffdaily · Pull Request #2 · jmaczan/tiny-vllm · GitHub

//voltron/pull_requests_fragments/pull_request_layout" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//voltron/pull_requests_fragments/pull_request_layout;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

jmaczan

tiny-vllm

Public

Notifications<br>You must be signed in to change notification settings

Fork<br>51

Star<br>821

Open<br>[ROCm] Add AMD GPU support via HIP for tiny-vllm#2jeffdaily wants to merge 3 commits intojmaczan:mainjmaczan/tiny-vllm:mainfrom jeffdaily:moat-portjeffdaily/tiny-vllm:moat-portCopy head branch name to clipboard

Conversation

jeffdaily

commented

Jun 17, 2026

Copy link

hipBLAS, and the `__nv_bfloat16` -> `__hip_bfloat16` type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for `__shfl_*_sync`, so the hardcoded `0xffffffff` becomes a `WARP_FULL_MASK` macro (`0xffffffffffffffffULL` on HIP, `0xffffffff` on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).

`CMakeLists.txt` gains a `USE_HIP` option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via `CMAKE_HIP_ARCHITECTURES` (it is not hardcoded):

```<br>cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja<br>cmake --build build<br>```

The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.

## Validation

Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).

On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and running prefill+decode produces coherent, correct output (for example "What is 2+2?" -> 4 and "Capital of France?" -> Paris), exercising the complete path (embedding, 16 transformer layers with hipBLAS GEMMs, paged attention, SwiGLU MLP, lm_head). The CUDA build path is unchanged.

Authored with the assistance of Claude.<br>" data-view-component="true" class="dropdown-item btn-link">

Copy Markdown

This adds AMD GPU support to tiny-vllm through ROCm/HIP while leaving the existing NVIDIA/CUDA build unchanged.

The CUDA kernels and host code are reused as-is. A new src/cuda_to_hip.h compatibility header keeps the CUDA spellings in the source and aliases them to their HIP equivalents (runtime calls, cuBLAS -> hipBLAS, and the __nv_bfloat16 -> __hip_bfloat16 type) when building for AMD. The only kernel-source change is the warp-shuffle mask: HIP requires a 64-bit lane mask for __shfl_*_sync, so the hardcoded 0xffffffff becomes a WARP_FULL_MASK macro (0xffffffffffffffffULL on HIP, 0xffffffff on CUDA). The paged-attention reduction is wave-size agnostic, so the same source runs correctly on wave64 (gfx90a) and wave32 (gfx1100, gfx1201).

CMakeLists.txt gains a USE_HIP option (default OFF). When OFF, the build is the existing CUDA configuration, unchanged. When ON, it enables the HIP language, compiles the sources with hipcc, and links hipBLAS. The GPU architecture is selected by the caller via CMAKE_HIP_ARCHITECTURES (it is not hardcoded):

cmake -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 -G Ninja<br>cmake --build build

The README's setup section documents the AMD build path alongside the existing NVIDIA instructions.

Validation

Built and exercised on real AMD GPUs -- gfx90a (MI250X), gfx1100 (Radeon Pro W7800), and gfx1201 (RX 9070 XT). On each, the HIP runtime, the bf16 embedding-gather kernel, the 64-bit-mask warp-shuffle reduction at 64 threads/block, and the hipBLAS bf16 GEMM all pass (the 64-bit mask fix confirmed on both wave64 and wave32).

On gfx1100, full end-to-end inference was additionally validated: loading Llama 3.2 1B Instruct weights and...

build tiny vllm mask cuda gfx1100

Related Articles