AWSome Distributed AI | Day1Training
Open Source • MIT-0 License
AWSome<br>Distributed AI
Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.
Explore Frameworks
Getting Started
30+<br>Test Cases
10<br>Architectures
Frameworks
1.5K<br>Commits
Training Frameworks<br>Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.<br>🔥<br>PyTorch<br>Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF.<br>FSDPDDPDeepSpeedTorchTitanPicotronvLLMTRLOpenRLHF<br>Megatron<br>NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism.<br>Megatron-LMNeMoNeMo RLBioNeMo<br>🧬<br>JAX<br>Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism.<br>PaxMLXLATPU/GPU<br>🧠<br>AWS Neuron / Trainium<br>NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers.<br>NeuronXOptimum NeuronTrainium<br>🤖<br>Physical AI & Robotics<br>Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models.<br>Isaac LabOpenVLAV-JEPA 2nanoVLM<br>🎯<br>Reinforcement Learning<br>RLHF, DPO, PPO, and scalable RL frameworks for LLM alignment and post-training.<br>TRLvERLSLIMEPPODPO<br>🧪<br>Model Customisation<br>Knowledge distillation, compression, and model adaptation techniques for production.<br>DistillationCompressionTransfer Learning
Reference Architectures<br>CloudFormation templates and deployment guides for every AWS compute platform.<br>🖥️ SageMaker HyperPod<br>Managed GPU clusters
🔧 ParallelCluster<br>HPC with Slurm
☸️ Amazon EKS<br>Kubernetes orchestration
📦 AWS Batch<br>Serverless batch
🌐 VPC & Networking<br>EFA-optimized
💾 Storage (S3/FSx)<br>High-throughput
⚡ HyperPod + EKS<br>Managed HPC on K8s
🧮 AWS PCS<br>Parallel Computing Service
Get Started in Minutes<br>Three steps to launch your first distributed training job.<br>Deploy Infrastructure<br>Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.
Build Container<br>Use our Dockerfiles to build a training container with your framework of choice.
Launch Training<br>Submit your job with Slurm or Kubernetes using our ready-made launch scripts.
ESC
Type to search across all content...
↵ select ↑↓ navigate Powered by Pagefind