Training Thousands of LoRA Adapters at Once
Book a Demo
Training Thousands of LoRA Adapters at Once<br>Published on<br>Jun 2, 2026
TL;DR: We extended Miles with a multi-adapter LoRA training path that lets us train thousands of LoRA adapters concurrently and asynchronously. The core change is a modification to Megatron-Bridge + Miles which allows us to load multiple LoRA adapters as a single matrix. On our Qwen3.6-35B-A3B + GSM8K stress test, we ran 1,536 LoRA adapter instances concurrently with step time under 3 minutes.
Overview
LoRA-based post-training decomposes a model into two components: a shared base model and a lightweight low-rank adapter that captures task-specific updates. This setup works well when training a single policy, since only the adapter parameters need to be optimized. However, it underutilizes resources when many policies are fine-tuned in parallel. Each LoRA training run still requires deployment of the full base model alongside its adapter, causing the same underlying model weights to be replicated across concurrent runs. As the number of policies scales, this duplication introduces substantial wasted VRAM.
What if we could share the same base model between policies, and just fine-tune different LoRA adapters in a single batch? This is cleaner and improves scalability: we can keep one base model, route tokens to different LoRA adapters, and have the training/inference stack treat LoRA adapters as cheap concurrent policies rather than separate model replicas.
We built on our prior work of supporting LoRA RL for the Qwen3.5 model family, with the goal of extending our training stack from "one LoRA policy" to "many LoRA policies, in the same training step."
The motivation is simple: base models are large, LoRA adapters are small. If we want to run thousands of RL experiments (i.e. prompt/harness, reward design, and curriculum ablations), we can't replicate the full base model for every individual training run.
We built our multi-LoRA framework on top of Miles, RadixArk's continuously evolving open source RL post-training framework. Miles already provides us the pieces we require for large scale RL, such as:
Megatron-based training with support for flexible modifications
SGLang-based rollout with support for scaling to thousands of LoRA adapters
Unified FP8 training support
We added multi-LoRA training by:
Deploying one shared Qwen3.6-35B-A3B base model
Supporting multiple LoRA adapter slots in Megatron-Bridge
Implementing multi-LoRA rollouts and training with Miles
Online loading and unloading adapters without restarting the RL trainer
Serving multiple LoRA adapters using SGLang's native multi-LoRA interface
Keeping experts adapter-free to unlock additional memory savings
LoRA with GRPO
GRPO-style RL objectives already separate the trainable policy from a frozen reference model. In the DeepSeekMath/GRPO objective, the policy update is regularized by a KL term against \(\pi_{\text{ref}}\):
\[\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \min(\cdots) - \beta D_{\text{KL}} \left[ \pi_\theta \Vert \pi_{\text{ref}} \right] \right) \right]\]
For LoRA policy \(a\), the model parameters are:
\[\theta_a = \theta_0 + \Delta_a\]
where \(\theta_0\) is the shared base model and \(\Delta_a\) is the adapter-specific LoRA delta. If every LoRA adapter starts from the same base model, then the correct frozen reference for every one is the same model:
\[\pi_{\text{ref}, a} = \pi_{\theta_0}\]
The policy logprob depends on the adapter:
\[\log \pi_a(y_t \mid s_t) = \log \pi_{\theta_0 + \Delta_a}(y_t \mid s_t)\]
But the reference logprob does not:
\[\log \pi_{\text{ref}}(y_t \mid s_t) = \log \pi_{\theta_0}(y_t \mid s_t)\]
Using the common sampled KL estimator from GRPO implementations, define:
\[r_{a,t} = \frac{ \pi_{\text{ref}}(y_t \mid s_t) }{ \pi_a(y_t \mid s_t) } = \exp\left( \log \pi_{\text{ref}}(y_t \mid s_t) - \log \pi_a(y_t \mid s_t) \right)\]
Then the token-level KL penalty is:
\[\widehat{D}_{\text{KL}, a,t} = r_{a,t} - \log r_{a,t} - 1\]
Only the policy logprob changes with adapter \(a\). The reference logprob is shared across all adapters for the same token state. Therefore, one frozen reference model can score the packed batch once, and each adapter's policy loss can reuse those reference logprobs.
Crucially, this policy only holds when all adapters share the same reference checkpoint. If adapter slots are warm-started from different base models or finetunes, this theory breaks down and we are forced to use multiple reference models.
Multi-LoRA Implementation
Multiple Adapters in Megatron-Bridge
We began by adding a new multi-LoRA transform within Megatron-Bridge's PEFT module. The new transform, MultiLoRA, behaves similar to a normal PEFT transform by attaching the LoRA weights and freezing the layer. However, instead of wrapping a target module with one adapter, it wraps it with N...