Show HN: AI Infrastructure Knowledge Base

GPU Cluster Deployment and Operations Guide - AI Infrastructure Knowledge Base

Initializing search

Start here

Use as an agent skill

Changelog

GPU Hardware & Commissioning

Build & Commission

Cluster Platform

Provisioning & Scheduling

Platform Services

Cluster Technologies

Training & Inference Workloads

Disaggregated Inference

Training Algorithms

Post-training & RL

Agentic Systems

Harness Engineering

AI Security

Performance & Operations

GPU Performance Engineering

GPU Architecture & Memory

CUDA Kernel Optimization

CUDA Libraries & Tools

PyTorch Performance

Attention Kernels

Host & System Tuning

Distributed Communication

Storage & Data I/O

Inference Optimization

Recipes & Runbooks

Kubernetes & Helm — GPU Platform

Telemetry, Monitoring & Alerting

Workload & Bring-Up Recipes

Recipe: Fabric Validation (nccl-tests)

Recipe: Gang-Scheduled Training

Recipe: vLLM Inference Deployment

Recipe: Memory-Efficient GRPO Post-Training

Playbook: End-to-End Bring-Up

SRE, Platform & MLOps Practices

Runbooks

Strategy & Reference

AI infrastructure knowledge base¶

Scope: the landing page for this knowledge base, what it covers and where to start. Reference/landing page, not a single implementation topic.

A practical, citable knowledge base for deploying, operating, and optimising GPU clusters , from the physical datacentre and the InfiniBand fabric up through Kubernetes, Slurm and Ray, distributed training and reinforcement-learning post-training, and LLM inference serving at scale. It covers the full NVIDIA range: Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems (including DGX Spark), with their operational, install, and networking differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current focus. Current to mid-2026.

It is written for the people who keep accelerators busy: systems administrators, GPU server engineers, platform engineers, SREs, and MLOps engineers . Every page follows a predictable shape, carries reference examples (Ansible, Helm/Kubernetes, Slurm, PyTorch, vLLM), and includes architecture diagrams and links to the primary papers and documentation.

This knowledge base (ai-infrastructure.net ) is built and maintained by setloop.io , the company behind it.

flowchart LR HW["GPU hardware"] --> BUILD["Build and commission"] BUILD --> PLATFORM["Cluster platform"] PLATFORM --> TRAIN["Training and post-training"] PLATFORM --> SERVE["Inference serving"] TRAIN --> OPS["Operate and optimise"] SERVE --> OPS Open the reading paths Browse the glossary

What's inside¶

GPU hardware

The full NVIDIA range: Ampere, Hopper, Blackwell datacenter GPUs; RTX consumer and workstation cards; DGX systems and DGX Spark, and how their ops differ.

GPU generations · RTX & workstation

Build & commission

Bill-of-materials validation, datacentre power and cooling, the HPC networking fabric, and commissioning to acceptance.

Networking fabric · Blackwell platform

Cluster technologies

Kubernetes, k3s, Ray and Slurm, each with what/why/when, how to use, develop, scale, serve, and fine-tune.

Orchestration overview

Training & post-training

FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo; SFT/LoRA, DPO and GRPO; verl, slime, SkyRL and more.

Fine-tuning & RL · RL libraries

Inference serving

Serving the latest open-weight models (Kimi K2, GLM, DeepSeek, Qwen), continuous batching, KV cache, and disaggregated prefill/decode.

Inference serving · Disaggregation

Operate & optimise

Observability, RAS and XID failure modes, NCCL and hardware tuning, SLOs/SLIs, and error-budget alerting.

Observability · Reliability & RAS

Recipes & runbooks

Ansible playbooks, Helm/Kubernetes manifests, telemetry stacks, and step-by-step operational runbooks for the recurring incidents.

Recipes & manifests · Runbooks

How to use this knowledge base¶

Concept pages explain a topic and its traps: overview, core knowledge, a don't-miss checklist, failure modes, and references.

Recipe and runbook pages are example-first: copy-paste manifests, playbooks, and step-by-step procedures with the commands to apply and verify them.

Per-technology pages give each cluster technology, training algorithm, RL library, and runbook its own page following a fixed shape: what it is, why and when to use it, how to use, develop, scale, serve for inference, fine-tune, and run on optimised hardware, plus a cookbook of common use cases.

Suggested starting points¶

New here? Read the knowledge base index for the full map and reading paths.

Standing up a cluster? Ansible bring-up → Kubernetes & Helm platform → telemetry.

Serving a model? Serve open-weight models → SLO/SLI catalog.

Fine-tuning? SFT & LoRA → GRPO → RL libraries.

References¶

NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/

NVIDIA DGX SuperPOD reference architecture:...

Show HN: AI Infrastructure Knowledge Base

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI