GPU Cluster Deployment and Operations Guide - AI Infrastructure Knowledge Base
Skip to content
Initializing search
Start here
Use as an agent skill
Changelog
GPU Hardware & Commissioning
Build & Commission
Cluster Platform
Provisioning & Scheduling
Platform Services
Cluster Technologies
Training & Inference Workloads
Disaggregated Inference
Training Algorithms
Post-training & RL
Agentic Systems
Harness Engineering
AI Security
Performance & Operations
GPU Performance Engineering
GPU Architecture & Memory
CUDA Kernel Optimization
CUDA Libraries & Tools
PyTorch Performance
Attention Kernels
Host & System Tuning
Distributed Communication
Storage & Data I/O
Inference Optimization
Recipes & Runbooks
Kubernetes & Helm — GPU Platform
Telemetry, Monitoring & Alerting
Workload & Bring-Up Recipes
Recipe: Fabric Validation (nccl-tests)
Recipe: Gang-Scheduled Training
Recipe: vLLM Inference Deployment
Recipe: Memory-Efficient GRPO Post-Training
Playbook: End-to-End Bring-Up
SRE, Platform & MLOps Practices
Runbooks
Strategy & Reference
AI infrastructure knowledge base¶
Scope: the landing page for this knowledge base, what it covers and where to start. Reference/landing page, not a single implementation topic.
A practical, citable knowledge base for deploying, operating, and optimising GPU<br>clusters , from the physical datacentre and the InfiniBand fabric up through<br>Kubernetes, Slurm and Ray, distributed training and reinforcement-learning<br>post-training, and LLM inference serving at scale. It covers the full NVIDIA range:<br>Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards,<br>and DGX systems (including DGX Spark), with their operational, install, and networking<br>differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current<br>focus. Current to mid-2026.
It is written for the people who keep accelerators busy: systems administrators, GPU<br>server engineers, platform engineers, SREs, and MLOps engineers . Every page follows a<br>predictable shape, carries reference examples (Ansible, Helm/Kubernetes, Slurm,<br>PyTorch, vLLM), and includes architecture diagrams and links to the primary papers and<br>documentation.
This knowledge base (ai-infrastructure.net ) is built and maintained by<br>setloop.io , the company behind it.
flowchart LR<br>HW["GPU hardware"] --> BUILD["Build and commission"]<br>BUILD --> PLATFORM["Cluster platform"]<br>PLATFORM --> TRAIN["Training and post-training"]<br>PLATFORM --> SERVE["Inference serving"]<br>TRAIN --> OPS["Operate and optimise"]<br>SERVE --> OPS<br>Open the reading paths<br>Browse the glossary
What's inside¶
GPU hardware
The full NVIDIA range: Ampere, Hopper, Blackwell datacenter GPUs; RTX consumer and<br>workstation cards; DGX systems and DGX Spark, and how their ops differ.
GPU generations ·<br>RTX & workstation
Build & commission
Bill-of-materials validation, datacentre power and cooling, the HPC networking<br>fabric, and commissioning to acceptance.
Networking fabric ·<br>Blackwell platform
Cluster technologies
Kubernetes, k3s, Ray and Slurm, each with what/why/when, how to use, develop,<br>scale, serve, and fine-tune.
Orchestration overview
Training & post-training
FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo; SFT/LoRA, DPO and GRPO;<br>verl, slime, SkyRL and more.
Fine-tuning & RL ·<br>RL libraries
Inference serving
Serving the latest open-weight models (Kimi K2, GLM, DeepSeek, Qwen), continuous<br>batching, KV cache, and disaggregated prefill/decode.
Inference serving ·<br>Disaggregation
Operate & optimise
Observability, RAS and XID failure modes, NCCL and hardware tuning, SLOs/SLIs, and<br>error-budget alerting.
Observability ·<br>Reliability & RAS
Recipes & runbooks
Ansible playbooks, Helm/Kubernetes manifests, telemetry stacks, and step-by-step<br>operational runbooks for the recurring incidents.
Recipes & manifests ·<br>Runbooks
How to use this knowledge base¶
Concept pages explain a topic and its traps: overview, core knowledge, a<br>don't-miss checklist, failure modes, and references.
Recipe and runbook pages are example-first: copy-paste manifests, playbooks, and<br>step-by-step procedures with the commands to apply and verify them.
Per-technology pages give each cluster technology, training algorithm, RL library,<br>and runbook its own page following a fixed shape: what it is, why and when to use it,<br>how to use, develop, scale, serve for inference, fine-tune, and run on optimised<br>hardware, plus a cookbook of common use cases.
Suggested starting points¶
New here? Read the knowledge base index for the full map<br>and reading paths.
Standing up a cluster? Ansible bring-up →<br>Kubernetes & Helm platform →<br>telemetry.
Serving a model? Serve open-weight models →<br>SLO/SLI catalog.
Fine-tuning? SFT & LoRA → GRPO →<br>RL libraries.
References¶
NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
NVIDIA DGX SuperPOD reference architecture:...