PithTrain – a compact, agent-native MoE training system

ruihangl1 pts0 comments

MLC | PithTrain: A Compact, Agent-Native MoE Training System

-->

Home

PithTrain: A Compact, Agent-Native MoE Training System

Jun 1, 2026

MLC Community

TL;DR. PithTrain is a compact, agent-native Mixture-of-Experts (MoE) training framework, in about 11K lines of Python. It trains as fast as mature production frameworks, and it is substantially cheaper for an AI coding agent to work with: on a suite of real training-system tasks, the same agent gets the job done with up to 62% fewer turns and 64% less GPU time than on production frameworks. We call this second axis agent-task efficiency , and as coding agents take on more of building and maintaining these systems, we think it deserves to sit alongside throughput as a metric worth optimizing.

GitHub: github.com/mlc-ai/pith-train

Paper: arxiv.org/abs/2605.31463

Why we built it

In just a couple of years, AI coding agents have gone from autocomplete to genuine collaborators. They fix bugs, ship features, review code, and operate infrastructure, and they are increasingly trusted with serious systems work that once demanded deep, specialized expertise. The shift is real, and it is accelerating. Some of that work is building and evolving the systems that train large models. Mixture-of-Experts (MoE) is now the dominant architecture for frontier models, and the frameworks that train them are remarkable pieces of engineering, refined over years to deliver broad model coverage, peak throughput, and support across many hardware platforms. But they were built for a specific audience: expert human engineers. At the time that work was done, an AI agent reading and modifying the code simply wasn’t part of the picture.

In particular, an agent reads a codebase differently than a person does. The very patterns that serve a human expert can work against an agent that operates turn by turn through a fixed set of tools. One layer skeleton reused across many models means more files to trace before the agent can tell what actually runs at a given call site. Peak-performance kernels written in compiled extensions introduce a language boundary, where an error surfaces with no Python line to anchor on and any change forces a rebuild. None of this is a flaw in those frameworks. Designing for an agent simply was not a goal anyone had yet, and what such a design should look like is still an open question.

People hit the same wall. Learning how MoE training works under the hood, or extending one of these systems, means navigating the same scale and indirection that slow an agent down. A codebase small enough to read end to end is easier for a person to learn from and build on, and that was part of our motivation from the start. There is also a gap in how we measure progress. When we evaluate a training framework, we report training throughput, such as tokens per second and MFU, and stop there. The cost of understanding, operating, and extending the system stays invisible, even as agents take on more and more of that work.

Figure 1: The dual-efficiency design.

So we asked a simple question: can an MoE training system be cheap for an agent to work on without giving up production-grade speed? PithTrain is our attempt at a yes. It is built for dual efficiency : strong training throughput together with high agent-task efficiency, the cost of using a coding agent to understand, operate, and extend the system. We make that cost concrete and measure it directly: how long a session runs, how much GPU time it consumes, how many back-and-forth turns it takes, how much the agent reads each turn, and how much it writes.

What PithTrain is

PithTrain is an end-to-end MoE training system: give it a tokenized corpus and it handles the rest, from distributed setup through to HuggingFace-compatible checkpoints. It trains models like Qwen3-MoE and GPT-OSS on NVIDIA Hopper and Blackwell GPUs, in BF16 or FP8, and scales across four kinds of parallelism: pipeline, data (FSDP), context, and expert. For pipeline parallelism it uses DualPipeV, an overlapped schedule that hides expert-parallel communication behind compute.

The whole thing is organized in three layers: an application layer (the training loop), an engine (the DualPipeV scheduler, optimizer, and checkpointing), and an operator layer (a few custom Triton kernels). It is about 11K lines in total.

Figure 2: PithTrain's architecture.

What makes it agent-native

PithTrain is built on four design principles. None of them is novel on its own; what’s new is treating them as primary constraints for a training system and measuring what they buy you.

1. Keep it compact. PithTrain covers exactly what a distributed MoE training system needs (about 11K lines) versus well north of 150K for production cores. Less code means less to search, fewer cross-file dependencies to track, and less to read before you’re sure a change is complete. It also means that with today’s 200K–1M-token context windows, an agent can hold the entire framework in one pass...

agent training pithtrain system work compact

Related Articles