Show HN: AI Infrastructure Knowledge Base

hevalon1 pts0 comments

GPU Cluster Deployment and Operations Guide - AI Infrastructure Knowledge Base

Skip to content

Initializing search

Start here

Use as an agent skill

Changelog

GPU Hardware & Commissioning

Build & Commission

Cluster Platform

Provisioning & Scheduling

Platform Services

Cluster Technologies

Training & Inference Workloads

Disaggregated Inference

Training Algorithms

Post-training & RL

Agentic Systems

Harness Engineering

AI Security

Performance & Operations

GPU Performance Engineering

GPU Architecture & Memory

CUDA Kernel Optimization

CUDA Libraries & Tools

PyTorch Performance

Attention Kernels

Host & System Tuning

Distributed Communication

Storage & Data I/O

Inference Optimization

Recipes & Runbooks

Kubernetes & Helm — GPU Platform

Telemetry, Monitoring & Alerting

Workload & Bring-Up Recipes

Recipe: Fabric Validation (nccl-tests)

Recipe: Gang-Scheduled Training

Recipe: vLLM Inference Deployment

Recipe: Memory-Efficient GRPO Post-Training

Playbook: End-to-End Bring-Up

SRE, Platform & MLOps Practices

Runbooks

Strategy & Reference

AI infrastructure knowledge base¶

Scope: the landing page for this knowledge base, what it covers and where to start. Reference/landing page, not a single implementation topic.

A practical, citable knowledge base for deploying, operating, and optimising GPU<br>clusters , from the physical datacentre and the InfiniBand fabric up through<br>Kubernetes, Slurm and Ray, distributed training and reinforcement-learning<br>post-training, and LLM inference serving at scale. It covers the full NVIDIA range:<br>Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards,<br>and DGX systems (including DGX Spark), with their operational, install, and networking<br>differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current<br>focus. Current to mid-2026.

It is written for the people who keep accelerators busy: systems administrators, GPU<br>server engineers, platform engineers, SREs, and MLOps engineers . Every page follows a<br>predictable shape, carries reference examples (Ansible, Helm/Kubernetes, Slurm,<br>PyTorch, vLLM), and includes architecture diagrams and links to the primary papers and<br>documentation.

This knowledge base (ai-infrastructure.net ) is built and maintained by<br>setloop.io , the company behind it.

flowchart LR<br>HW["GPU hardware"] --> BUILD["Build and commission"]<br>BUILD --> PLATFORM["Cluster platform"]<br>PLATFORM --> TRAIN["Training and post-training"]<br>PLATFORM --> SERVE["Inference serving"]<br>TRAIN --> OPS["Operate and optimise"]<br>SERVE --> OPS<br>Open the reading paths<br>Browse the glossary

What's inside&para;

GPU hardware

The full NVIDIA range: Ampere, Hopper, Blackwell datacenter GPUs; RTX consumer and<br>workstation cards; DGX systems and DGX Spark, and how their ops differ.

GPU generations ·<br>RTX & workstation

Build & commission

Bill-of-materials validation, datacentre power and cooling, the HPC networking<br>fabric, and commissioning to acceptance.

Networking fabric ·<br>Blackwell platform

Cluster technologies

Kubernetes, k3s, Ray and Slurm, each with what/why/when, how to use, develop,<br>scale, serve, and fine-tune.

Orchestration overview

Training & post-training

FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo; SFT/LoRA, DPO and GRPO;<br>verl, slime, SkyRL and more.

Fine-tuning & RL ·<br>RL libraries

Inference serving

Serving the latest open-weight models (Kimi K2, GLM, DeepSeek, Qwen), continuous<br>batching, KV cache, and disaggregated prefill/decode.

Inference serving ·<br>Disaggregation

Operate & optimise

Observability, RAS and XID failure modes, NCCL and hardware tuning, SLOs/SLIs, and<br>error-budget alerting.

Observability ·<br>Reliability & RAS

Recipes & runbooks

Ansible playbooks, Helm/Kubernetes manifests, telemetry stacks, and step-by-step<br>operational runbooks for the recurring incidents.

Recipes & manifests ·<br>Runbooks

How to use this knowledge base&para;

Concept pages explain a topic and its traps: overview, core knowledge, a<br>don't-miss checklist, failure modes, and references.

Recipe and runbook pages are example-first: copy-paste manifests, playbooks, and<br>step-by-step procedures with the commands to apply and verify them.

Per-technology pages give each cluster technology, training algorithm, RL library,<br>and runbook its own page following a fixed shape: what it is, why and when to use it,<br>how to use, develop, scale, serve for inference, fine-tune, and run on optimised<br>hardware, plus a cookbook of common use cases.

Suggested starting points&para;

New here? Read the knowledge base index for the full map<br>and reading paths.

Standing up a cluster? Ansible bring-up →<br>Kubernetes & Helm platform →<br>telemetry.

Serving a model? Serve open-weight models →<br>SLO/SLI catalog.

Fine-tuning? SFT & LoRA → GRPO →<br>RL libraries.

References&para;

NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

NVIDIA DGX SuperPOD reference architecture:...

training platform knowledge inference base cluster

Related Articles