Distributed AI on AWS

AWSome Distributed AI | Day1Training

Open Source • MIT-0 License

AWSome Distributed AI

Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.

Explore Frameworks

Getting Started

30+ Test Cases

10 Architectures

Frameworks

1.5K Commits

Training Frameworks Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests. 🔥 PyTorch Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF. FSDPDDPDeepSpeedTorchTitanPicotronvLLMTRLOpenRLHF Megatron NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism. Megatron-LMNeMoNeMo RLBioNeMo 🧬 JAX Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism. PaxMLXLATPU/GPU 🧠 AWS Neuron / Trainium NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers. NeuronXOptimum NeuronTrainium 🤖 Physical AI & Robotics Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models. Isaac LabOpenVLAV-JEPA 2nanoVLM 🎯 Reinforcement Learning RLHF, DPO, PPO, and scalable RL frameworks for LLM alignment and post-training. TRLvERLSLIMEPPODPO 🧪 Model Customisation Knowledge distillation, compression, and model adaptation techniques for production. DistillationCompressionTransfer Learning

Reference Architectures CloudFormation templates and deployment guides for every AWS compute platform. 🖥️ SageMaker HyperPod Managed GPU clusters

🔧 ParallelCluster HPC with Slurm

☸️ Amazon EKS Kubernetes orchestration

📦 AWS Batch Serverless batch

🌐 VPC & Networking EFA-optimized

💾 Storage (S3/FSx) High-throughput

⚡ HyperPod + EKS Managed HPC on K8s

🧮 AWS PCS Parallel Computing Service

Get Started in Minutes Three steps to launch your first distributed training job. Deploy Infrastructure Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.

Build Container Use our Dockerfiles to build a training container with your framework of choice.

Launch Training Submit your job with Slurm or Kubernetes using our ready-made launch scripts.

ESC

Type to search across all content...

Distributed AI on AWS

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews