Distributed AI on AWS

gjmveloso1 pts0 comments

AWSome Distributed AI | Day1Training

Open Source • MIT-0 License

AWSome<br>Distributed AI

Reference architectures, test cases, and best practices for training large-scale models with PyTorch, Megatron-LM, NeMo, JAX, and more on AWS infrastructure.

Explore Frameworks

Getting Started

30+<br>Test Cases

10<br>Architectures

Frameworks

1.5K<br>Commits

Training Frameworks<br>Production-ready examples grouped by framework. Each includes Dockerfiles, Slurm scripts, and Kubernetes manifests.<br>🔥<br>PyTorch<br>Native distributed training with DDP, FSDP, TorchTitan, DeepSpeed, and more. Covers LLMs, vision, robotics, and RLHF.<br>FSDPDDPDeepSpeedTorchTitanPicotronvLLMTRLOpenRLHF<br>Megatron<br>NVIDIA Megatron-LM and NeMo for large-scale LLM pre-training with tensor, pipeline, and expert parallelism.<br>Megatron-LMNeMoNeMo RLBioNeMo<br>🧬<br>JAX<br>Google JAX with PaxML for distributed training leveraging XLA compilation and automatic parallelism.<br>PaxMLXLATPU/GPU<br>🧠<br>AWS Neuron / Trainium<br>NeuronX Distributed for training on AWS Trainium & Inferentia chips with optimized compilers.<br>NeuronXOptimum NeuronTrainium<br>🤖<br>Physical AI & Robotics<br>Embodied AI training with NVIDIA Isaac Lab, OpenVLA, V-JEPA2, and vision-language-action models.<br>Isaac LabOpenVLAV-JEPA 2nanoVLM<br>🎯<br>Reinforcement Learning<br>RLHF, DPO, PPO, and scalable RL frameworks for LLM alignment and post-training.<br>TRLvERLSLIMEPPODPO<br>🧪<br>Model Customisation<br>Knowledge distillation, compression, and model adaptation techniques for production.<br>DistillationCompressionTransfer Learning

Reference Architectures<br>CloudFormation templates and deployment guides for every AWS compute platform.<br>🖥️ SageMaker HyperPod<br>Managed GPU clusters

🔧 ParallelCluster<br>HPC with Slurm

☸️ Amazon EKS<br>Kubernetes orchestration

📦 AWS Batch<br>Serverless batch

🌐 VPC & Networking<br>EFA-optimized

💾 Storage (S3/FSx)<br>High-throughput

⚡ HyperPod + EKS<br>Managed HPC on K8s

🧮 AWS PCS<br>Parallel Computing Service

Get Started in Minutes<br>Three steps to launch your first distributed training job.<br>Deploy Infrastructure<br>Launch a cluster using our CloudFormation templates for HyperPod, ParallelCluster, or EKS.

Build Container<br>Use our Dockerfiles to build a training container with your framework of choice.

Launch Training<br>Submit your job with Slurm or Kubernetes using our ready-made launch scripts.

ESC

Type to search across all content...

↵ select ↑↓ navigate Powered by Pagefind

training distributed megatron frameworks launch architectures

Related Articles